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Chapter 1. Training Data Introduction 


A NOTE FOR EARLY RELEASE READERS 
With Early Release ebooks, you get books in their earliest form—the 
author’s raw and unedited content as they write—so you can take advantage 


of these technologies long before the official release of these titles. 


This will be the 1st chapter of the final book. Please note that the GitHub 


repo will be made active later on. 


If you have comments about how we might improve the content and/or 
examples in this book, or if you notice missing material within this chapter, 


please reach out to the editor at jleonard@oreilly.com. 


Data is all around us. Videos, images, text, 3D, geospatial, documents, and 
more. Yet, in its raw form this data is of little use to machine learning (ML). 
How do we make use of this data? How do we record our intelligence so it 
can be reproduced through ML? The answer is the Art of Training Data - 


the discipline of making raw data useful. 
In this book you will learn: 


e All-new Training Data specific concepts 


¢ The Day-to-Day Practice of Training Data 


¢ How to improve Training Data efficiency 
e Real world case studies 


e How to transform your team to be more AI/ML centric 


Before we can cover some of these concepts, we first have to understand 


the foundations, which this chapter will unpack. 


Training Data is about molding, reforming, shaping, and digesting raw data 
into new forms. Creating new meaning out of raw data to solve problems. 
These acts of creation and destruction sit at the intersection of subject 
matter expertise, business needs, and technical requirements. It’s a diverse 


set of activities that crosscut multiple domains. 


At the heart of these activities is annotation. Annotation produces structured 
data that is ready to be consumed by a machine learning model. Without 
annotation, raw data is considered to be unstructured and not usable. That’s 
why training data is required for modern machine learning use cases 
including computer vision, natural language processing and speech 


recognition. 


To cement this idea in an example let’s consider annotation in detail. When 
we annotate data, we are capturing human knowledge. Typically, this 
process looks as follows: a piece of media such as an image, text, video, 
3D, or audio, is presented along with a set of predefined options (labels). A 


human reviews the media and determines the most appropriate answers. For 


example, declaring a region of an image to be “good” or “bad”. This label 
provides the context needed to apply machine learning concepts (Figure 1- 


1). 


But how did we get there? How did we get to the point that the right media 
element, with the right predefined set of options, is shown to the right 
person at the right time? There are many concepts that lead up to and follow 
the moment where that annotation, or knowledge capture, actually happens. 


Collectively all of these concepts are the art of training data. 


Structured Data Ready 
for Model Consumption 


Raw Data Human Review 


Figure 1-1. The Training Data Process 


In this chapter, we’ll introduce what training data is, why it matters, and 


dive into many key concepts that will form the base for the rest of the book. 


Training Data Intents 


What can you do with Training Data. What is it most concerned with? What 
are people aiming to achieve with Training Data? The purpose of Training 
Data varies across different use cases, problems, and scenarios. Let’s 


explore some of the most common questions. 


What Can You Do With Training Data? 


Training Data is the foundation of AI/ML systems - the underpinning that 
makes these systems work. With Training Data, you can build and maintain 
modern ML systems, such as ones that create next generation automations, 


improve existing products, and even create all new products. 


In order to be useful, the data needs to be presented in a structured way to 
ML programs. That’s where Training Data comes in - adding and 
maintaining structure to make the raw data useful. If you have great 


Training Data, you are on the path towards a great overall solution. 
In practice, common use cases center around: 


e Improving an existing product (e.g., performance), even if ML is not 
currently a part of it 

e Production of a new product, including systems that run in a limited or 
“one off” fashion 


e Research and Development 


Training Data transcends all parts of ML programs. Training data comes up 
before you can run an ML Program, it comes up during running in terms of 
output and results, and even later in analysis and maintenance. Further, 
Training Data concerns tend to be long lived. For example, after getting a 
model up and running, maintaining the Training Data is an important part of 


maintaining a model. 


The creation and maintenance of novel data is a primary concern of this 
book. A dataset at a moment in time is an output of the complex processes 
of Training Data. For example, a Train/Test/Val split is a derivative of an 
original, novel set. And that novel set itself is simply a snapshot, a single 
view into larger Training Data processes. Similar to how a programmer may 
decide to print or log a variable, the variable printed is just the output, it 
doesn’t explain the complex set of functions that were required to get the 
desired value. A goal of this book is to explain the complex processes 


behind getting usable datasets. 


Annotation, the act of humans directly annotating samples, is an important 
part of Training Data. However, it is just one part, and the process of 


Training Data involves many others, outlined later in this chapter. 


What is Training Data Most Concerned With? 


This book covers a variety of people, organizational, and technical 


concerns. We’ll walk through each of these concepts in detail in a moment, 


but before we do, let’s think about areas Training Data is focused on. 


For example, how does the Schema, which is a map between your 
annotations and their meaning for your use case, accurately represent the 
problem? How do you ensure raw data is collected and used in a way 
relevant to the problem? How is human validation, monitoring, controls, 


and correction applied? 


How do you repeatedly achieve and maintain acceptable degrees of quality 
when there is such a large human component? How does it integrate with 


other technologies, including data sources and your application? 


While not perfect, to help organize this you can broadly divide into the 
following topics: Schema, Raw Data, Quality, Integrations, and the Human 


Role. Next, I’ll take a deeper look at each of those topics. 
Schema 


Schema is central in every aspect of Training Data. Schema is the map 
between human input and meaning for your use case. It defines what the 
ML program is capable of outputting. It’s the vital link, it’s what binds 


together everyone’s hard work. So, to state the obvious, it’s important. 


A good Schema is useful and relevant to your specific need. It’s usually 
best to create a new, custom Schema, and then keep iterating on it for your 


specific cases. It’s normal to draw on domain specific databases for 


inspirations, or to fill in certain levels of detail, but be sure that’s done in 
the context of guidance for a new, novel, Schema. Don’t expect an existing 
Schema from another context to work for ML programs without further 


updates. 


So, why is it important to design it according to your specific needs, and not 


some predefined set? 


First, the Schema is for both human annotation and ML machine use. An 
existing domain specific schema may be designed for human use in a 
different context or for machine use in a classic, non-ML context. This is 
one of those cases where something that’s output seems really similar, but is 


actually formed in totally different ways. 


Like two different math functions that both output the same value, but run 
on completely different logic. The output of the Schema may appear 
similar, but the differences are important to make it friendly to annotation 


and ML use. 


Second, if the Schema is not useful, then even great model predictions are 
not useful. Failure with Schema design likely will cascade to failure of the 
overall system. The context here is that ML programs can usually only 

predict what is included in the Schema.+ It’s rare that an ML Program will 


produce relevant results that are better than the original Schema. It’s also 


rare that it will predict something that a human, or group of humans, 


looking at the same raw data could not also predict. 


It is common to see Schemas that have questionable value. So, it’s really 
worth stopping and thinking “If we automatically got the data labeled with 
this Schema, would it actually be useful to us?”. And “Can a human looking 


at the raw data, reasonably choose something from the Schema that fits it?” 


In the first few chapters, we will cover the technical aspects of Schema, and 
we will come back to Schema concerns through practical examples later in 


the book. 


Raw data 


When we think about raw data, the most important thing is that it’s 


collected and used in a way relevant to the Schema. 


To illustrate the idea of relevance let’s consider the difference between 
hearing a sports game on the radio, seeing it on TV, or being at the game in 
person. It’s the same event regardless of the medium, but you receive a very 
different amount of data in each context. The context of the collection 
frames the potential of the data. So, for example, if you were trying to 
determine possession of the ball automatically, the visual data will likely be 


a better fit then the radio data. 


Compared to software we humans are good at automatically making 
contextual correlations and working with noisy data. We make many 
assumptions, often drawing on data sources not present in the moment to 
our senses. This ability to understand the context above the directly sensed 
sights, sounds, etc. makes it difficult to remember that software is more 


limited here. 


Software only has the context that is programmed into it, be it through data 
or lines of code. This means the real challenge with raw data is overcoming 


our human assumptions around context to make the right data available. 


So how do you do that? One of the more successful ways is to start with the 
Schema, and then map ideas of raw data collection to that. It can be 
visualized as a chain of Problem -> Schema -> Raw Data. The Schema need 
is always defined by the Problem or Product. That way there is always this 
easy check of “Given the Schema, and the raw data, can a human make a 


reasonable judgment?” 


Centering around the Schema also encourages thinking about new ways of 
data collection, instead limiting to existing or easiest to reach data 
collection methods. Over time the Schema and Raw Data can be jointly 
iterated on, this is just to get started. Another way to relate it on the product 
side, is that the Schema represents the Product. So, to use the cliche of 


“Product Market Fit’, this is “Product Data Fit”. 


To put the above abstractions into more concrete terms, here’s some of what 


I see most common in industry: 


Differences between data used during development and production are one 
of the most common sources of errors. It is common because it is somewhat 
unavoidable. That’s why being able to get to some level of “real” data early 
in the iteration process is crucial. You have to expect that production data 
will be different, and plan for it as part of your data overall data collection 


strategy. 


The data program can only see the raw data and the annotations. Only what 

is given to it. If a human annotator is relying on knowledge outside of what 

can be understood from the sample presented, it’s unlikely the data program 
will have that context, and it will fail. We must remember that all needed 


context must be present, either in the data or lines of code of the program. 


To recap: 


e The raw data needs to be relevant to the Schema. 
e The raw data should be as similar to production data as possible. 


e The raw data should have all the context needed in the sample itself. 


Quality 


Training Data quality is naturally a spectrum. What is acceptable in one 


context may not be in another. 


So, what are the biggest factors that go into Training Data quality? 


Well, we already talked about two of them: Schema and raw data. For 


example: 


e A bad Schema may cause more quality issues than bad annotators. 
e If the concept is not clear in the raw data sample, it’s unlikely it will be 


clear to the ML program. 


Often, Annotation quality is the next biggest item. Annotation quality is 
important, but perhaps not in the ways you may expect. Specifically, that 
people tend to think of Annotation quality as “was it annotated right”? But 


“right” is often out of scope. 


To understand how the “right” answer is often out of scope, let’s imagine 
we are annotating traffic lights, and the light in sample you are presented is 
off (e.g., power failure) and your only options from the Schema are 
variations on an active traffic light. Clearly either the Schema needs to be 
updated to include an ‘off’ traffic light, or our production system will never 


be usable in a context where a traffic light may have a power failure. 


To move into a slightly harder to control case, consider if the traffic light is 
really far away or at an odd angle, that will also limit the ability to annotate 
it properly. Often these cases sound like they should be easily manageable 
but in practice they often aren’t. So more generally, real issues with 


annotation quality tend to circle back to issues with the Schema and Raw 


Data. So, quality with Annotation is more about communication of issues, 
of annotators surfacing problems in Schema and Data, then just about 


annotating “correctly”. 


Hopefully this has hammered home the point that Schema and raw data 
deserve a lot of attention. However, annotating correctly does still matter, 
and one of the approaches is to have multiple people look at the same 
sample. This is often costly. And someone must interpret the meaning of the 
multiple opinions on the same sample, adding further cost. For an industry 
usable case, where the Schema has a reasonable degree of complexity, the 


meta-analysis of the opinions is a further time sink. 


Think of a crowd of people watching a sports game instant replay. Imagine 
trying to statistically sample their opinions to get a “proof” of what is “more 
right”. Instead of this, we have a referee who individually reviews the 
situation and makes a determination. That determination may or may not be 
“right”, but it’s more cost effective then trying to survey the crowd, and 


realistically works better. 


Similarly, often a more cost-effective approach is to randomly sample a 
percent of the data for a review loop, and have annotators raise issues with 
the Schema and raw data fit, as they occur. This review loop and quality 


assurance processes will be discussed in more depth later. 


If the review method fails, and you think you still need multiple people you 
probably have a bad Product Data Fit and need to change the Schema or 


raw data collection to fix it. 


Zooming out from Schema, Raw Data, and Annotation, the other big 
aspects of quality is maintenance of the data and the integration points with 
ML programs. Quality includes cost considerations, expected use, and 


expected failure rates. 


To recap here, quality is first and foremost formed by the Schema and Raw 
data, then by the Annotators and associated processes, and rounded out by 


the maintenance and integration. 


Integrations 


Much time and energy are focused on “training a model”. However, this 
often misses the point because training a model is a primarily technical, and 


primarily data science focused concept. 


What about maintenance of the training data? What about ML programs 
that output useful training data results, such as sampling, finding errors, 
reducing workload etc., that are not to do with training a model? How about 
the integration with the application the results of the model or ML sub 
program will be used in? What about tech that tests and monitors datasets? 
The hardware? Human notifications? How is the technology packaged into 


other tech? 


Training a model is just one component. To successfully build an ML 
program, a data driven program, we need to think about how all the 
technology components work together. And to avoid reinventing the wheel 


we need to be aware of the growing Training Data ecosystem. 


A few key aspects to remember about working with integrations: 


e The training data is only useful if it can be consumed by something, 
usually within a larger program. 

e Integration with data science is multi-faceted, it’s not just about some 
final “output” of annotations. It’s about the ongoing human control, 
maintenance, Schema, validation, lifecycle, security, etc. A batch of 
outputted annotations is like the result of a single SQL query, it’s a 
single, limited view into a complex database. 


¢ Getting a model trained is only a small part of the overall ecosystem. 


The human role 


Training Data involves the human, the subject matter experts and 
annotators, directly in the process. Humans are programing, creating, and 


controlling through data. 


Humans exert control on data programs by controlling the Training Data. 
This includes controlling the aspects we have discussed so far, the Schema, 
raw Data, Quality, and integrations with other systems. And of course, 


Annotation itself, the humans looking at each individual sample. 


This control is exercised at many stages, and by many people: from initial 
training data to human evaluations of data science outputs validating data 
science results. This large volume of people involved is very different from 


classic ML. 


We have new metrics, like how many samples were accepted, how long is 
spent on each task, lifecycle of datasets, fidelity of raw data, what the 
distribution of the Schema looks like. These concepts may overlap with 
data science concepts, like class distribution, but are worth thinking of as 
separate concepts. For example, model metrics are based on the ground 
truth of the Training Data so if the data is wrong the metrics are wrong. And 
as discussed in the QA section, metrics around something like annotator 


agreement, miss larger points of Schema and Raw Data issues. 


Human oversight is about so much more than just quantitative metrics. It’s 
about qualitative understanding. Human observation, human understanding 
of the Schema, raw data, individual samples, etc. is of great importance. 
This qualitative view extends into business and use case concepts. Further 
these validations and controls quickly extend from being easily defined, to 
more of an art from, acts of creation. Not to mention political and social 


expectations around system performances and output. 


Working with training data is an opportunity to create. To capture human 
intelligence and insights in novel ways. To frame problems in a new 


Training Data context. To create new Schema, new raw data capture, and 


other Training Data specific methods. Subject matter experts can directly 


create new data by annotating. 


This creation, this control, it’s all new. While we have established patterns 
for various types of human computer interaction. There is much less 
established for human ML program interactions. For human supervision a 
data driven system, where the humans can directly correct and program the 


data. 


For example, we expect an average office worker to know how to use word 
processing, but we don’t expect them to use video editing tools. Training 
data requires subject matter experts. So, in the same way a doctor must 
know how to use a computer for common tasks today, they must now learn 
how to use standard annotation patterns. As human controlled, data driven, 
programs emerge and become more common these interactions will 


continue to increase in importance and variance. 


Training Data Opportunities 


Now that we understand many of the fundamentals, let’s frame some 
opportunities. If you’re considering adding Training Data to your ML/AI 


program, some questions you may want to ask are: 


e What are the best practices? 


e Are we doing this the “right” way? 


e How can my team work more efficiently with Training Data? 

e What business opportunities can Training Data centric projects unlock? 

e Can I turn an existing work process, like an existing quality assurance 
pipeline, into training data? What if all of my training data could be in 
one place instead of shuffling data from A to B to C? How can I be more 


proficient with Training Data tools? 


Broadly, a business can: 


e Increase Revenue by shipping new AI/ML Data products. 

e Maintain Existing Revenue by improving performance of an existing 
product through AI/ML Data. 

e Reduce Security Risks — Reduce risks and costs from AI/ML data 
exposure and loss. 

e Improve Productivity by moving employee work further up the 
automation food chain. For example, by continuously learning from data 


— creating your AI/ML Data engine. 


All of these elements can lead to transformations through an organization, 


which I’Il cover next. 


Business Transformation 


Your team and company’s mindset around training data is important. I’ll 
provide more detail in the Transformation Chapter, but for now, here are 


some important ways to start thinking about this. 


e Start viewing all existing routine work at the company as an opportunity 
for Training Data 

e Realize that work not captured in a Training Data system is lost 

¢ Begin shifting annotation to be part of every frontline worker’s day 

e Define your organizational leadership structures to better support 
Training Data efforts 

e Manage your training data processes at scale. What works for an 
individual data scientist is very different from a team, and different still 


from a corporation with multiple teams. 


In order to accomplish all of this, it’s important to implement strong 
Training Data practices within your team and organization. To do this, you 
need to create a Training Data centric mindset at your company. This can be 


complex and may take time, but it’s worth the investment now. 


To do this, involve subject matter experts in your project planning 
discussions. They’ Il bring valuable insights that will save your team time 
downstream. It’s also important to use tools to maintain abstractions and 
integrations for raw data collection, ingest, and egress. You’!l need new 
libraries for specific training data purposes so you can avoid reinventing the 
wheel. Having the proper tools and systems in place will help your team 
perform with a data centric mindset. And finally, make sure you and your 
teams are reporting and describing training data. Understanding what was 
done, why it was done, and what the outcomes were will inform future 


projects. 


All of this may sound daunting now, so let’s break things down a step 
further. When you first get started with training data, you’ ll be learning new 
training data specific concepts that will lead to mindset shifts. For example, 
adding new data and annotations will become part of your routine 
workflows. You’ll be more informed as you get initial datasets, schema, and 
other configurations setup. This book will help you become more familiar 
with new tools, new APIs, new SDKs, and more, enabling you to integrate 


training data tools into your workflow. 


Training Data Efficiency 


Efficiency in training data is a function of many parts. We’|l explore this in 


greater detail in the chapters to come, but for now, consider these questions: 


e How can we create and maintain better Schemas? 

e How can we better capture and maintain Raw Data? 

e¢ How can we annotate more efficiently? 

e How can we reduce the relevant sample counts so there is less to 
annotate in the first place? 

e How can we get people up to speed on new tools? 

e How can we make this work with our application? What are the 


integration points? 


As with most processes, there are a lot of areas to improve efficiency, and 


this book will show you how sound Training Data practices can help. 


Tooling Proficiency 


New tools, like Diffgram, now offer many ways to help realize your 
Training Data goals. As these tools grow in complexity, being able to 
master them becomes more important. You may have picked up this book 
looking for a broad overview, or to optimize specific pain points. The 


tooling chapter will dive into those concerns. 


Common Pain Points 
To highlight a few common challenges, such as: 


e Annotation quality is poor, too costly, too manual, too error prone. 

e Duplicate work 

e Subject matter expert labor cost too high 

¢ Too much routine or tedious work 

e It is near impossible to get enough of the original raw data 

e Raw data volume clearly exceeds any reasonable ability to manually 


look at it. 


What are you trying to achieve with Training Data? 


Why Training Data Matters 


In this section, I’1l cover why Training Data is important for your 
organization. And why a strong training data practice is essential. These are 
central themes throughout the book, and you’|I see them come up again in 


the future. 


First, Training Data determines what your AI program, your system, can do. 
Without Training Data there is no system. With Training Data, the 
opportunities are only bounded by your imagination. Anything that you can 
form into a Schema and record raw data for, the system can repeat. It can 
learn anything. Meaning the intelligence and ability of the system depends 
on the quality of the Schema, and the volume and variety of data you can 


teach it. 


Second, Training Data work is upstream, before Data Science work. This 
means Data Science is dependent on Training Data. Errors in Training Data 
flow down to Data Science. Or to use the cliché - garbage in, garbage out. 


Figure 1-2 walks through what this data flow looks like in practice. 


Strong Training 
Data leads to 
strong Al 


Data Science 


Figure 1-2. Conceptual position of training data and data science 


Third, The Art of Training Data represents a shift in thinking about how to 
build AI systems. Instead of over focus on improving mathematical 
algorithms, in parallel we optimize the Training Data to better match our 
needs. This is the heart of the AI Transformation taking place and the core 
of modern automation. For the first time knowledge work is now being 


automated. 


ML Applications are Becoming Mainstream 
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This commercialization goes beyond headlines of AI research results. In the 
last few years, we have seen the demands placed on technology increase 
dramatically. We expect to be able to speak to software and be understood, 
to automatically get good recommendations and personalized content. Big 
tech companies, startups, and business alike are increasingly turning to AI 


to solve this explosion in use case combinations. 


AI knowledge, tooling and best practices rapidly expand. What used to be 
the exclusive domain of a few is now becoming common knowledge and 
pre-built API calls. We are at the transition phase, going from R&D demos, 


to the early stage of real-world industry use cases. 


Expectations around automation are being redefined. Cruise control to a 
new car buyer has gone from just “maintain constant speed” to include 
“lane keeping, distance pacing, and more”. These are not future 
considerations. These are current consumer and business expectations. They 
are clear and present needs to have an AI strategy and to have ML and 


training data competency in your company. 


The Foundation of Successful AI 


Machine Learning is about learning from data. Historically, this meant 
creating datasets in the form of logs, or similar tabular data such as 


“Anthony viewed a video.” 


These systems continue to have significant value. However, they have some 
limits. They won’t help us do things modern training data powered AI can 
do like build systems to understand a CT scan or other medical imaging, 


understand football tactics, or in the future operate a vehicle. 


The idea behind this new type of AI is a human expressly saying “here’s an 
example of what a player passing a ball” looks like, “Here’s what a tumor 


looks like”, or “This section of the apple is rotten”. 


This form of expression is similar to how in a classroom a teacher explains 
concepts to students: by words and examples. Teachers help fill the gap 
between the textbooks, and the multidimensional understanding that a 
student builds over time. In Training Data, the annotator acts as the teacher, 


filling the gap between the Schema and the raw data. 


Dataset (Definition) 

A dataset is like a folder. It usually has the special meaning that there are 
both “raw” data (such as images) and annotations in the same place. For 
example, a folder of 100 images plus a text file that lists the annotations. 


In practice a dataset is dynamic and stored in part in a database form. 


Training Data is Here to Stay 


Training Data is going to be here for a very long time. Conceptually I think 


Training Data concepts will still exist decades, or even a century or more 


from now. How can I say that with confidence? Let’s think about the trends. 


As mentioned earlier, use cases for modern AI/ML data are transitioning 
from R&D to industry. We are at the very start of a long curve on that 
business cycle. Naturally the specifics shift quickly. However, the 
conceptual ideas around thinking of day-to-day work as annotation, 
encouraging people to strive more and more for unique work, and oversight 


of increasingly capable ML programs, are all here to stay. 


On the research side, algorithms and ideas on how to use Training Data 
both keep improving. For example, the trend is for certain types of models 
to require less and less data to be effective. The less samples a model needs 
to learn, the more weight is put on creating Training Data with greater 
breadth and depth. And on the other side of the coin, many industry use 
cases often require even greater amounts of data to reach business goals. In 
that business context, the need for more and more people to be involved in 


training data puts further pressure on tooling. 


In other words, the expansion directions of research and industry put more 


and more importance on Training Data over time. 


Training Data Controls the ML Program 


The question in any system is control. Where is the control? In normal 


computer code, this is human-written logic in the form of loops, if 


statements, etc. This logic defines the system. 


In classic Machine Learning, the first steps include defining features of 
interest and a dataset. Then an algorithm generates a model. While it may 
appear that the algorithm is in control, the real control is exercised by 
choosing the features and data. The algorithm’s degrees of freedom are then 


controlled by the features and the data. 


In a Deep Learning system, the algorithm does its own Feature Selection. 
The algorithm attempts to determine what features are relevant to a given 
goal. That goal is defined by Training Data itself. In fact, Training Data is 


the entire definition of the goal. 


Here’s how it works. An internal part of the algorithm, called a loss 
function, describes a key part of how the algorithm can learn a good 
representation of this goal. The algorithm uses the loss function to 


determine how close it is to the goal defined in the training data. 


More technically, the loss is the error we want to minimize during training. 
The loss function doesn’t work without having some exterior defined goal 

(such as the training data). In a sense, this is a “goal within a goal”. It’s the 
Loss function’s goal to optimize the Loss, but it can only do that by having 


some reference point, which is defined by the training data. 


Therefore, the training data is the “ground truth” for correctness of the 


model’s relationship to the human defined goal. 


To further understand why this is the case, consider that in many use cases 
the loss function is closely related to the task. For example, a given object 
detection approach usually includes references to a specific loss function. 
That said, in the case of “unsupervised” learning the Loss function may be 
more related to the goal. While this may seem like a contradiction at first 


blush, for practical purposes it’s generally not relevant to supervised cases. 


New Types of Users 


In traditional software development there is a degree of dependency 
between the end user and the engineering. The end user cannot truly say if 


the program is “correct,” and neither can the engineer. 


It’s hard for an end user to say what they want until a prototype of it has 
been built. Therefore, both the end user and engineer are dependent on each 
other. This is called a circular dependency. The ability to improve the 
software comes from the interplay between both, to be able to iterate 


together. 


With Training Data, the humans control the meaning of the system when 
doing the literal supervision. The Data Scientists control it when working 


on the Schema, such as choosing abstractions such as label templates. 


For example, if I as an annotator were to label a tumor as cancerous when 


in fact it’s benign, I would be controlling the output of the system in a 


detrimental way. In this context, it’s worth understanding there is no 
validation possible to ever 100% eliminate this control. Engineering cannot, 
both because of volume of data and because of lack of subject matter 


expertise, control the data system. 


To understand why this goes beyond the “garbage in, garbage out” phrase. 
Consider that in a traditional program, while the end user may not be happy, 
the engineer can, through a concept called unit tests, at least guarantee that 
the code is “correct”. This doesn’t mean that it gives the output desired by 
the end user, but just that the code does what the engineer feels it’s 


supposed to do. 


Writing that style of unit test is impossible in the context of training data— 
because the controls available to engineering, such as a validation set, are 


still based on the control executed by the individual AI supervisors. 


Further, the AI supervisors are generally bound by the control exerted by 
engineering in defining the abstractions they are allowed to use. The end 
user is coding through data. In a sense the end user is woven into the fabric 


of the system itself. 
This blurring of the lines between “content” and “system” is important. 


This is distinctly different from classic systems. For example, on a social 


media platform, your content may be the value, but it’s still clear what is the 


literal system (the box you type in, the results you see, etc.) and the content 


you post (text, pictures, etc.). 
Examples of control include: 


e Abstractions, like the Schema, define one level of control. 


e Annotation, literally looking at samples, defines another level of control. 


While Data Science may control the algorithms, the controls of Training 


data often act in an “oversight” capacity, above the algorithm. 


Training Data in the Wild 


So far, we’ve covered a lot of concepts and theory, but training data in 


practice can be a complex and challenging thing to do well. 


What Makes Training Data Difficult? 


The apparent simplicity of data annotation hides the vast complexity, novel 
considerations, new concepts and new forms of art involved. It may appear 
that a human selects an appropriate label, the data goes through a machine 
process, and voila, we have a solution, right? Well, not quite. Here are a few 


common elements that can prove difficult. 


e When experts from various fields have to work closely with each other. 


Subject matter experts (SMEs) are working with technical folks in new 
ways and vice versa. These new social interactions introduce new people 
challenges. 

Experts have individual experiences, beliefs, inherent bias, and prior 
experiences. 

Users are operating novel annotation interfaces with few common 
expectations on what standard design looks like 

The problem itself may be difficult with unclear answers or poorly 
defined solutions 

Even if the knowledge is well formed in a person’s head, and the person 
is familiar with the annotation interface, inputting that knowledge 
accurately can be tedious and time consuming 

Often there is a voluminous amount of data labeling work with multiple 
datasets to manage and technical challenges around storing, accessing 
and querying the new forms of data 

Given that this is a new discipline, there is a lack of organizational 
experience and operational excellence that can only come with time 
Organizations with a strong classical ML culture may have trouble 
adapting to this fundamentally different, yet operationally critical, area. 
This blindspot of thinking they have already understood and 
implemented ML when in fact it’s a totally different form. 

There is a lack of awareness, access or familiarity to the right training 
data tools 


As a new art form general ideas and concepts are not well known 


e Schemas may be complex with thousands of elements including nested 
conditional structures 

e Media formats impose challenges like series, relationships, and 3D 
navigation 


e Most automation tools introduce new challenges and difficulties 
And that’s the short list. 


While the challenges are myriad and at times difficult, we’ll tackle each of 
these in this book to provide a roadmap you and your organization can 


implement to improve training data. 


The Art of Supervising Machines 


Up to this point, we’ve covered some of the basics and a few of the 
challenges around training data. Let’s shift gears away from the science for 
a moment and focus on the art. The apparent simplicity of annotation hides 
the vast volume of work involved. Annotation is to Training Data is what 
typing is to writing. Simply pressing keys on a keyboard doesn’t provide 
value if you don’t have the human element informing the action and 


accurately carrying out the task. 


Training Data is a new paradigm upon which a growing list of mindsets, 


theories, research and standards are emerging. This involves technical 


representations, people decisions, processes, tooling, system design, and a 


variety of new concepts specific to it. 


One thing that makes training data so special is that it is capturing the user’s 
knowledge, intent, ideas, concepts, without specifying “how” they arrived 
at them. For example, if I label a “bird”, I am not telling the computer what 
a bird is, the history of birds, etc.-- only that it is a bird. This idea of 
conveying a high level of intent is different from most classic programming 
perspectives. Throughout this book I will come back to this idea of thinking 


of training data as a new form of coding. 


A New Thing 


Training Data is not Data Science. They have different goals. Training Data 
produces structured data. Data Science consumes it. Training Data is 
mapping human knowledge from the real world into the computer. Data 
Science is mapping that data back to the real world. They are the two 


different sides of the coin. 


In the same way that model is embedded in an application, even a minimal 
input/output, in order to be useful, training data must be consumed by data 
science to be useful. The fact that it’s used in this way should not detract 
from its differences. There are still mappings of concepts to a form usable 
by data science. The point is having clearly defined abstractions between 


them, instead of ad-hoc guessing on terms. 


It seems more reasonable to think of Training Data as an Art practiced by 
all the other professions, than to think of Data Science as the all- 
encompassing starting point. Given how many subject matter experts and 
non-technical people are involved, that rather preposterous alternative 


would seem to assume that Data Science towers over all! 


While attempting to call anything a new art form is automatically 
presumptuous, I take solace in that I am simply labeling something people 
are already doing. In fact, things make much more sense when we treat it as 


its own art and stop shoehorning into other existing given categories. 


I cover this in more detail in chapter 7 - AI Transformation. 


Because Training Data is new, the language and definitions remain fluid. 


The following terms are all closely related: 


e Training Data 

e Data Labeling 

e¢ Human Computer Supervision 
e Annotation 


e Data Program 


Depending on the context, those terms can map to various definitions: 


e The overall Art of Training Data. 


e The act of annotating, such as drawing geometries and answering 
Schema questions. 

e The definition of what we want to achieve in a machine learning (ML) 
system, the ideal state desired, the control of the ML system, including 
correction of existing systems. 


e A system that relies on human controlled data. 


For example, I can say Annotation as a specific sub component of the 
overall concept of Training Data. I can also say “to work with Training 
Data”, to mean the act of annotating. As a novel developing area people 
may say Data Labeling and mean just the literal basics of annotation, while 


others mean the overall concept of Training Data. 


The short story here is it’s not worth getting too hung up on any of those 
terms, and the context it’s used in is usually needed to understand the 


meaning. 


Media Types 


Data comes in many media types. Popular media types include Images, 
Videos, Text, PDF/Document, HTML, Audio, Timeseries, 3D/DICOM, 
Geospatial, Sensor Fusion, Multimodal. While popular media types are 
often the best supported in practice, in theory any media type can be used. 
Forms of Annotation include attributes (detailed options), geometries, 


relationships and more. We’I| cover all of this in great detail as the book 


progresses, but it’s important to note that if a media type exists, someone is 


likely attempting to extract data from it. 


ML Program Ecosystem 


Training Data interacts with a growing ecosystem of adjacent programs and 
concepts. It is common to send data from a Training Data program to an 
ML Modeling program. Or to install an ML program on a Training Data 
platform. Production data, such as predictions, are often sent to a Training 
Data program for validation, review, and further control. The linkage 
between these various programs continues to expand. Later in this book we 


cover some of the technical specifics of ingesting and streaming data. 


Data-Centric Machine Learning 


Subject matter experts (SMEs) and data entry folks may end up spending 4- 
8 hours a day, every day, on training data tasks like annotation. It’s a time- 
intensive task, and it may become their primary work. In some cases, 99% 
of the overall team’s time is spent on Training Data and 1% on the 
modeling process, for example by using an AutoML type solution or having 


a large team or SMEs.4 


Data-Centric AI is focusing on Training Data as being its own important 
thing. Creating new data, new Schema’s, new raw data capture, and new 


annotations by subject matter experts. Developing programs with Training 


Data at the heart, and deeply integrating Training Data into aspects of your 


program. There was mobile-first, and now there’s data-first. 


In the data centric mindset, you can: 


e Use or add data collection points. Such as new sensors. New cameras, 
new ways to capture documents etc. 


e Add new human knowledge. 


The rationales behind a data-centric approach are: 


e The majority of the work is in the training data, and the data science 
aspect is out of our control. 
e There are more degrees of freedom with training data and modeling, than 


with algorithm improvements alone. 


When I combine this idea of Data-Centric AI, with the idea of seeing the 
breadth and depth of Training Data as its own art, I start to see the vast 


fields of opportunities. What will you build with training data? 


Failures 


It’s common for any system to have a variety of bugs and still generally 
“work”. Data programs are similar. For example, some classes of failures 


are expected, and others are not. Let’s dive in. 


Data programs work when their associated sets of assumptions remain true. 
For example, assumptions around the Schema and raw data. These 
assumptions are often most obvious are creation, but can be changed or 


modified as part of a data maintenance cycle. 


To dive into a visual example, imagine a parking lot detection system. The 
system may have very different views as shown in Figure 1-3.5. If we 
create a training data set based on a top-down view (left) and then attempt 
to use a car level view (right) we will likely get an “unexpected” class of 


failure. 


Figure 1-3. Comparison of major differences in raw data that would likely lead to an unexpected 
failure 


Why is it a failure? A machine learning system trained only on images from 
a top-down view as in the left image has a hard time running in an 
environment where the images are from a front view as shown in the right 


image. In other words, the system would not understand the concept of a car 


and parking lot from a front view if it has never seen such an image during 


training. 


While this may seem obvious, a very similar issue caused a real world 
failure in a US Air Force system, leading them to think their system was 


materially better than it actually was. 


How can we prevent failures like this? Well for this specific one it’s a clear 
example of why it’s important that the data we use to train a system closely 
matches production data. What about failures that aren’t listed specifically 


in a book? 


The first step is being aware of training data best practices. Earlier talking 
about the Human Role, I mentioned how communication with annotators 
and subject matter experts is important. Annotators need to be able to flag 
issues. Issues between Schema and raw data alignment. And to surface 
issues outside the scope of specified instructions and Schema, e.g., that 


“common sense” that something isn’t right. 


Admins need to be aware of the concept of creating a novel, well named, 
Schema. That the raw data should always be relevant to the Schema. That 


maintenance of the data will be required. 


Failure modes are surfaced during development through discussions around 


Schema, expected data usage, and discussions with Annotators. 


Failure example in a deployed system 


Given how new some of these systems are it’s likely that we have barely 


seen the smallest of the failure cases of training data. 


In April 2020, Google deployed a medical AI to help with COVID-19.° 
They trained it higher quality scans then what was available during 
production. So, when people went to actually use it, they had to often retake 
the scans to try and meet that expected quality level. And even with this 
extra burden of retaking them, the system still rejected about 25%. That 
would be like an email service that made you resend every second email 


and completely refused to deliver every fourth email. 


Of course, there are nuances to that story but conceptually it shows how 
important it is to align the development and production data. What the 
system trains on needs to resemble what will actually be used in the field. In 
other words, don’t use “lab” level scans for the development set and then 
expect a smart phone camera to work well in production. If production will 
be using a smart phone camera, then the training data needs to be from it 


too. 


Failing to Achieve the Desired Bias 


When we think of classic software programs, any given program is 
“Biased” towards certain states of operation. For example, an application 


desired for a smartphone has a certain context, and may be better or worse 


than a desktop application at certain things. A spreadsheet app that may be 


better suited for Desktop use. 


This bias may be intentional. For example, an official money sending 
system will need to disallow random edits, where as a personal note taking 
app may want to make it easy to edit. Once a program like that has been 
written, it becomes hard to “unbias it”. The edit focused program was built 
assuming the user would be allowed to (generally) - edit stuff. Whereas the 
money sending app has many assumptions built around an end user not 


being able to “undo” a transaction. 


There’s a similar concept in Training Data. Let’s imagine a crop inspection 
application. Imagine it’s mostly designed around diseases that affect potato 
crops. There are assumptions made regarding everything from the “raw” 
data (e.g., that the media is captured at certain heights), to the types of 
diseases, to the volume of samples. It’s unlikely it will work well for other 
types of crops. Therefore, it’s important to ensure the Schema fits your 


desired application goals, the desired bias. 


I will cover Bias from many angles and provide practical tips on how to 


work with Training Data to achieve your desired Bias. 


What Training Data Is Not 


Training Data is not an ML algorithm. It is not tied to a specific machine 


learning approach. 


Rather it’s the definition of what we want to achieve. The fundamental 
challenge is effectively identifying and mapping the desired human 


meaning into a machine-readable form. 


The effectiveness of training data depends primarily on how well it relates 
to the human defined meaning and how reasonably it represents real model 
usage. Practically, choices around Training Data have a huge impact on the 


ability to train a model effectively. 


Summary 


This chapter has introduced high level ideas around Training Data for 


Machine Learning. 
Let’s recap why Training Data is important: 


¢ Consumers and businesses are increasing expectations around having 
ML built-in, both for existing and new systems, increasing the 
importance of Training Data. 

e It serves as the foundation of developing and maintaining modern ML 


programs. 


e Training data is an Art and a new paradigm. It’s a set of ideas around 
new, Data driven programs, controlled by humans, separate from classic 
ML comprised of new of philosophies, concepts, and implementations. 

e It forms the foundation of new AI/ML products, maintaining revenue 
from existing lines of business, by replacing or improving costs through 
AI/ML upgrades, and is a fertile ground for R&D. 

e Asa technologist or as a subject matter expert, it’s now an important 


skillset to have. 


The art of Training Data is distinct from Data Science. It’s the control of the 
system. The goal for the system to learn. Training Data is not an algorithm 
or a single dataset. It’s a paradigm that spans people from Subject Matter 
Experts, to Data Scientists, to Engineering and more. It’s a way to think 


about systems that opens up new use cases and opportunities. 


Before reading on, I encourage you to feel comfortable with the high-level 


ideas introduced earlier: 


e Schema, Raw Data, Quality, Integrations, and The Human Role are all 
key concerns. 

e Classic Training Data is about discovery while modern Training Data is 
about copying existing knowledge. 

e Deep Learning algorithms generate models based on training data. 
Training Data defines the goal and the algorithm defines how to work 


towards this goal. 


¢ Training data that is validated “in a lab” will likely fail in the field. This 
can be avoided by primarily using field data as the starting point, by 
aligning the system design, and by expecting to rapidly update models. 


e Training Data is like Code 


I introduced core concepts, such as literal representations, assumptions, 
randomness, automation processes, tooling, and more. In the next chapter 
we will cover two example tasks, introduce a basic training data loop, and 


training data management concepts. 


Without further deductions outside our scope of concern. 
Defense Advanced Research Projects Agency (DARPA) 


There are statistical methods to coordinate experts opinions but these are 


always “additional”, there still has to be an existing opinion. 


I’m oversimplifying here. In more detail, the key difference is that while a 
data science AutoML training product and hosting may be complex itself, 


there are simply less people working on it. 


https://www.technologyreview.com/2020/04/27/1000658/google-medical-ai- 


accurate-lab-real-life-clinic-covid-diabetes-retina-disease/ 


Chapter 2. Getting Up and Running 


A NOTE FOR EARLY RELEASE READERS 
With Early Release ebooks, you get books in their earliest form—the 
author’s raw and unedited content as they write—so you can take advantage 


of these technologies long before the official release of these titles. 


This will be the 2nd chapter of the final book. Please note that the GitHub 


repo will be made active later on. 


If you have comments about how we might improve the content and/or 
examples in this book, or if you notice missing material within this chapter, 


please reach out to the editor at jleonard@oreilly.com. 


Introduction 


We have databases to smoothly store data. Web servers to smoothly serve 


data. And now Training Data tools to smoothly work with Training Data. 


There are established processes and expectations for how databases 


integrate with the rest of your application. 


What about Training Data? How do you get up and running with training 


data? From installation, annotation setup, embedding, end user, workflow, 


and more I will cover all the key considerations. 


I say smoothly because I don’t have to use a database. I could write my data 
to a file and read from that. Why do I need Postgres? Well because Postgres 
brings a vast variety of features, such as guarantees that my data won’t 
easily get corrupted, that data is recoverable and that data can be queried 


efficiently. Training Data tools have evolved in a similar way. 


In this chapter I will cover 


e How to get up and running? 

e Scope of Training Data Tools 

e What benefits do you get from using training data tools? 
¢ Trade-offs 

e History 


Most of this is focused on things that will be relevant to you today. I also 
include some brief sections on history and why these tools matter. I will 


also answer other common questions like 


¢ Key conceptual areas of Training Data tools. 


¢ Overview of where Training Data tools fit in your stack. 


I will often use tool as the word even when it may be a larger system or 


platform. By tool, I mean any technology that helps you accomplish your 


training data goals. While for example, I consider Diffgram to be a system, 


a platform, it’s also a tool. 


Getting Up and Running 


The following section is a minimal viable roadmap to get your ML program 
up and running. It is divided for convenience into sections. Usually, these 
tasks can be given to different people, and many can be done in parallel. 
Depending on a number of factors it may take many months to get fully set 


up. 


If you are coming from a fresh slate, then all of these steps will be 
applicable. If your team is already progressing well, then this provides a 


checklist to see if your existing processes are comprehensive. 
Broadly the overall getting started tasks include: 


e Installation 

e Annotation Setup 

e End User Setup 

e Data Ingestion Setup 
e Data Catalog Setup 
¢ Workflow Setup 

e Initial Usage 


e¢ Optimization 


The initial part here is just talking about common pain points and what 
needs to be done. Later under the Considerations section, I will discuss 


trade-offs of specific items. 


If that seems like it’s a lot, well that’s the reality of what’s needed to set up 


a successful system. 


In most of these steps, there are some level of cross-over, for example 
nearly any of the steps can be done through UI/SDK/API, but where 


appropriate I call attention to common preferences. 


Installation 


Your training data installation and configuration is done by a technical 


person, or team of people. 


High level concerns of installation include: 


e Provisioning hardware (cloud or otherwise) 

¢ Doing the initial installation 

¢ Configuring initial Security items, like Identity providers 
e Choosing storage options 

e Capacity planning 

e Maintenance dry-runs like doing updates 


e Provisioning initial super users 


Most teams shipping complex, revenue impacting products, do their own 
installation. This is just a reality of the level of importance of the data, and 
its deep connection to end users. Generally data setup is of a more fluid 
nature than the installation of the training data platform itself, so data setup 
is treated as it’s own section. Example getting started annotation screen 


showing in Fig 2-1. 


Diffgram is Commercial Open Source and fully featured. You can 


download diffgram from diffgram.com 


Figure 2-1. Example Diffgram dev installation. 


Annotation Setup 
Annotation setup is usually done by an Admin. 


e Initial Schema setup 


e Initial Human Task setups 


e Planning around Structuring Schema in relevant ways. As these tools 
grow there are more and more options and it becomes like database 


design. 


Depending on the complexity, Schema may be loaded through an SDK or 
API. 


Annotation is covered in more detail in the following chapters: Schema, 
Chapter 3, Annotation Admin Setup, Chapter 7, and Annotation Concepts, 
Chapter 8. 


End User Setup 
In some way the end user must be able to input data into the system. 
This can be done by Embedding data collection into your application. 


Or through a dedicated Portal. 


e Planning. Will you collect data by embedding directly in your 
application? Or will it be a separate stand-alone portal? 
¢ Development team adding the data collection 


¢ Customization of the look and feel of Annotation portals 


Embedded 


Your users are often the best people to provide supervision (annotation). 
Your users already have the context of what they want. Your users can 
provide supervision “for free”, which scales much better than trying to hire 
ever-larger central annotation teams. You or your engineering team will 


need to add some code to your application to collect data this way. 


Portal 


Using a stand-alone portal is often easiest. Meaning that annotators go 
directly to your installation, e.g. at a web address, to do the Annotation. You 
can “deep link” to the annotation portal in your application, for example 


linking to specific tasks. 


Data Setup 
You must load your raw data in a useful way to the system. 


e Using ingestion tools 
e Integrating within your custom application 


e SDKs/APIs usage 


This will be covered in greater depth in the Ingest Chapter. 


Workflow Setup 


Your data must be able to connect with your ML programs. Either from 


manual process or part of a named workflow. 


e Combining multiple action steps to create surfaced processes 


e Automating common steps into pipelines 


Workflow is covered in chapter 5. 


Data Catalog Setup 


In some way, you must be able to view the results of Annotation work. And 


must access the data at the set level. 


¢ Query languages. Domain Specific Query languages are becoming more 


popular for training data. And/or being able to understand raw SQL 
structure well enough to query directly. 

e Using training data-specific libraries for specific goals like data 
discovery, filtering, etc. 

e Even if there are Workflows setup, the need for data catalog-type steps 


remains. 


Initial Usage 


There is always a period of initial usage. This is where Annotators, your 
end users, data scientists, ML programs, etc. all operate on and with the 


data. 


e Training of users, especially if deployed towards in-house users 


e User feedback, especially if deployed towards end users 


Optimization 


Once the basics are set up there are many concepts that can be further 


optimized. 


¢ Day-to-day work of optimizing the Schema, raw data, and training data 
itself 

e Annotation proficiency, ergonomics, etc. 

¢ Literal annotation proficiency 

e Loading data into machine learning tools, libraries, and concepts 

e Contributing to open source Training Data projects 


e General knowledge of available new tools 


Tools Overview 


Training data tools are required to ship your AI/ML program. Key areas 


include 


e Annotation 
e Catalog 
e Workflow 


Annotation 


An end user annotates data using annotation tools. The spectrum goes from 
the annotation being part of your application to stand-alone pure annotation 


tools. In the most abstract terms, this is the “data entry” side. 


e Literal annotation UIs for images, video, audio, text, etc. 


e Manage Tasks, Quality Assurance and more. 


Catalog 


Catalog is the search, storage, exploration, curation, and usage of sets of 
data. This is the “data out” side, usually with some level of human 


involvement. For example looking at sets of data. 


e Ingest raw data, prediction data, metadata and more. 

e Explore: Everything from filtering uninteresting data to visually viewing 
it. 

e Debug: Debugging existing data, such as predictions and human 
annotations. 

e Data lifecycle including retention and deletion, Personally Identifiable 


Information, Access Controls 


Workflow 


Workflow is the processes and data flow of your training data. It’s the glue 
between annotation, catalog, and other systems. Think integrations, 
installations, plugins, etc. This is both data in and data out, but usually at 


more of a “system” level. 


¢ Ecosystem of technologies. 

e Annotation Automation: Anything that improves annotation 
performance, such as pre-labeling or active learning. See Chapter 6 for 
more depth. 

¢ Collaboration across teams between machine learning, product, ops, 
managers, and more. 


e Stream to Training: Getting the data to your models. 


Some products cover most of these areas in one platform. 


Training Data for Machine Learning 


Usually machine learning modeling and training data tools are different 


systems. 


The more ML program support that is natively included the less flexible 


and powerful the overall system is. 


As an analogy, in a word doc I can create a table. This is distinctly different 


from the power of formulas that a spreadsheet application brings. 


Usually the best way around this is great integrations so that systems that 
focus on training data quality, ML modeling, etc. can be their own system, 


but still tightly integrated with training data. 


This chapter will focus on the major sub-areas of training data specifically 


assuming that the model training is handled by a different system. 


Growing Selection of Tools 


There are an increasing number of notable platforms and tools becoming 
available. Some aim to provide broad coverage while others cover deep and 
specific use cases in each of these areas. There are tens of notable tools that 


fall into each of the major categories. 


As the demand for these commercial tools continues to grow I expect that 
there will be both a stream of new tools entering the market and a 
consolidation in some of the more mature areas. Annotation is one of the 


more mature areas. Data exploration in this context is relatively new. 


I encourage you to continuously explore the options available that may net 


different and improved results for your team and product in the future. 


People, Process, and Data 


Employee time in any form is often the greatest cost center. 


Well deployed tooling brings many unique efficiency improvements, many 
of which can stack to create many orders of magnitude improvements in 
performance. To continue the database analogy, think of it as the difference 
between sequential scans and indexes. One may never complete while the 
other is fast! Training data tools upgrades you to the world of indexes. 


Training data tools allow you to: 


e Embed supervision directly in your application. 

e Empower people, processes, and data in the context of human computer 
supervision. 

e Standardize your training data efforts around a common core. 


e Surface training data issues. 


Training Data tools similarly provide many benefits that go far beyond 
handling the minutiae. For example data scientists can query the data 
trained by Annotators without having to download huge sets and manually 
filter locally. However for that to work, a training data system must be setup 


first. 


Embedded 


When we get a spam email, we mark it as spam, teaching the system. We 
can add words to our spellchecker. These are simple examples of end user 
engagement. Over time, more and more of the supervision will be pushed as 


near the end user and embedded in systems. 


Best Practices and Levels of Competency 


Becoming highly competent with training data tools will take you at least as 


much work as it took to learn DevOps. 


Your training data learning is an ongoing journey rather than a destination. I 
bring this up to a level set that no matter how familiar you are, or how much 
time you spend there is always more to learn about training data - I am 


always learning myself! 


Human Computer Supervision 


You may be familiar with the concept of “human Computer Interaction” 
(HCI). How a user relates to and engages with a computer program. With 
Training Data I introduce a concept called Human Computer Supervision 


(HCS). This idea is that you are supervising the “computer”. 


The “computer” could be a machine learning model, or a larger system. The 
supervision happens on multiple levels, from annotation to approving 


datasets. 


To contrast these, in HCI the user is primarily the “consumer”, whereas in 
HCS the user is more the “producer”. The user is producing supervision (on 


top of interaction) which is consumed by the ML program. 


The key contrast here is that usually with computer interaction it’s 
deterministic. If I update something I expect it to be updated. Whereas with 
computer supervision, like with human supervision, it’s non-deterministic. 
There’s a degree of randomness. As a supervisor I can supply corrections, 
but for each new instance the computer still makes its own predictions. 
There is also a time element, usually a HCI is an “in the moment” thing. 
Where as HCS operates at longer time scales, where supervision effects a 


more nebulous unseen system of machine learning models. 


For the sake of space I won’t dwell on this distinction. Over time the 
general idea here is to keep differentiating between this new form of human 


computer supervision work (HCS) and regular computer usage (HCI). 


Separation of End Concerns 


Training data tools separate end user data capture concerns from other 
concepts. For example, being able to add in end user data capture of 


supervision, without worrying about the data flow. 


Standards 


Training Data tools are a means to effectively ship your Machine Learning 
product. As a means to this complex end, Training Data tools come with 
some of the most diverse opinions and assumptions of any area of modern 


software. 


Tools help bring some standardization and clarity to the noise. They also 
help bring teams that may otherwise have no benchmark of comparison, 


rapidly into the light. 


e Why not have end users provide some of the supervision? 

e Why manually version control when you can do it automatically? 

¢ Why manually export files, when you can stream the data you need? 

e Why have different teams storing the same data with slightly different 
tags, when you can use a single unified data store? 


e Why manually assign work when it can be done automatically? 


Expansive Tooling 
One way to try to picture it, Training Data tooling is: 


Photoshop + Word + Premiere Pro + Task Management + Big Data Feature 


Store + Data Science Tools 


e Annotators expect to be able to annotate as well as the best drawing tools 

e Engineers expect to be able to customize it 

e Managers expect modern task management like a dedicated task 
management system 

e Data Engineers expect to be able to ingest and process huge amounts of 


data, effectively an Extract Transform Load tool in it’s own right 


e Data Scientists expect to be able to conduct analysis on it like a business 


intelligence tool 
All in one system. 


Not having the right training data tools would be like trying to build a car 
without a factory. Achieving a fully setup training data system can only be 


accomplished through the use of tools. 


We have a tendency to take for granted the familiar. It’s just a car, or just a 
train, or just a plane. All engineering marvels in their own right. We 
similarly discount what we don’t understand. “Sales can’t be that hard” says 
the engineer. “If I were President” etc. I have found many people do not 


understand the breadth of training data tooling 


e The literal work of annotation, and the task management are rolled into 
one system 

e A suite of interfaces comparable to the adobe suite or office is rolled into 
one system 

¢ The systems many distinct functions must work in concern with input 
and output directly used in other systems, plus integrated data science 


tools 


A Paradigm to Deliver Machine Learning Software 


The same way the DevOps mindset gives you a paradigm to deliver 
software, Training Data mindsets give you a paradigm to ship Machine 


Learning software. To put it very plainly these tools: 


e Basic functions to work with training data, like annotation and datasets. 
Things that it would be otherwise impractical to do without tooling. 

e Provide guardrails to level set your project. Are you actually following a 
Training Data mindset? 

e The means to achieve training data goals like managing costs, tight 


iterative time-to-data loops and more. 


Trade-offs 


If you are working on trade offs to compare multiple tools for production 


usage here are some things to consider. 


In the rest of the book we continue to cover abstract concepts. In this 
section I pause to cover real world industry trade-offs, especially around 


tooling. 
Costs 


Common costs include Embedded Integration, End User Input 
Development and Customization, Commercial Software Licenses, 


Hardware, and Support. 


Besides commercial costs, all tools have hardware costs. to host and store 
data. Further, some tools charge for use of special tools, compute use, and 


more. 
Common cost reductions: 


e End User Input (Embedded) is materially less expensive than hiring 
more annotators. 

e Push automations onto the frontend as much as possible. This reduces 
the cost of server side compute. 


e Separate true data science training costs from annotation automation. 


Common licensing models include unlimited, by user, by cluster, or other 


more specific metrics. 


Some Commercial Open Source products may allow a trial to build out a 


case for a paid license, or be free for personal or education use. 


Most SaaS Training Data services have severe limits on the free tier. And 
some SaaS services may even have privacy clauses that permit them to use 


your data to build “mega” models that are to their benefit. 


Installed vs Software as a Service 


The volume of Training Data data is very high as compared to other types 


of software. Ten to thousands of times more than many other typical use 


cases. Second, the data is often sensitive nature, such as medical data, IDs, 
bank documents etc. Third, because training data is code, and often contains 
unique IP and Subject Matter expertise, it’s very important to protect it. So 


to recap Training Data is: 


e High Volume 
e Sensitive 


e Contains Unique IP 


The result of this is that there is a massive difference between a tool you 


can install on your own hardware, and using SaaS. 


Given these variables, there is a clear edge to training data products that can 
be installed on your hardware from day one. Keeping in mind “your 
hardware” may mean your cluster in a popular cloud provider. As 
packaging options improve it becomes easier and easier to get up and 


running on your own hardware. 


This is also another area where open source really shines. While SaaS 
providers sometimes have more expensive versions that can be locally 
deployed, the ability to inspect the source code is often still limited (or even 
zero!). Further, these use cases are often rather rigid: it’s a preset setup and 
requirement set. Whereas software that’s designed to run anywhere can be 


more flexible to your specific deployment needs. 


Development System 


There is a classic debate of “build or buy”. I think this should really be 


Customize, Customize, or Customize? 


Because there is no valid reason to start from scratch at this point with so 


many great options already available as starting points. 


¢ Some options, like Diffgram, offer full Development Systems, that allow 
you to build your own IP on top of the baseline platform including 
Embedding annotation collection in your application. 

¢ Some options have increasing degrees of out-of-the-box customization 


e Open Source options can be built on and extended 


For example, maybe your data requirements mean the ingest or database of 
a certain tool isn’t enough. Or maybe you have a unique UI requirement. 


The real question is: 


e Should we do this ourselves? 


e Get the Vendor to do it for us? 


Sequentially dependent discoveries 


A mental picture I like to think about is standing at the base of a large hill 
or mountain. From the base, I can’t see the next hill. And even from the top 


of that hill, my vantage is obscured by the next, such that I can’t see the 3rd 


mountain until I traverse the 2nd, as shown in Fig 2-2. Essentially that 


subsequent discoveries are dependent on earlier ones. 
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Figure 2-2. Sight-lines over Hills, I can only See The Next Mountain. 


Training Data tools help you smoothly traverse these mountains as you get 
them, and in some cases even “see around corners” and give you a birds eye 
view of the terrain. To make that concrete, I only better understood the need 
for querying data, once I realized that over time, the organization 
approaches used to annotate data, often don’t align with the needs of a data 
scientist, especially on larger teams. This means that no matter how good 
the initial dataset organization is, there is still a need to go back afterwards 


and explore it. 


Training data tools are likely to surprise you with unexpected opportunities 
to improve your process and ship a better product. They provide a baseline 


process. They help avoid thinking you have re-invented the wheel, only to 


realize an off the shelf system already does that and with a bit more finesse. 
They improve your business key performance indicators by helping you 


ship faster, with less risk, and with more efficiency. 


Now of course it’s not that these tools are a cure-all. Nor are they bug free. 
Like all software they have their hiccups. In many ways these are the early 


years of these tools. 


Scale 


Disney World operates from a local entertainment center like an arcade. 


What works for Disney doesn’t work for the arcade and vice versa. 


As I covered in Automation Chapter, at the extreme end of the scale, a fully 
setup Training Data system allows you to retrain your models practically on 
demand. Improving the speed of time-to-data (the time between when data 
arrives to when a model is deployed) to approach zero can mean the 


difference between tactical relevancy or worthlessness. 


Often the terms we are used to thinking about for scale with regular 
software aren’t as well defined with supervised Training Data. Here I take a 


moment to set some expectations with scale. 


Why is it useful to define scale? 


Well first to understand what stage you are in to help inform your research 
directions. Second, to understand that various tools are built for different 
levels of scale. A rough analogy may be sqlite vs postgres. Two different 


purposes from the simple to the complex. 


A large project will typically have the greatest percentage of annotation 
from the embedded collection. A super-large project may go through 
billions of annotations per month. At that point it’s less about training a 
singular model, and more about routing and customizing a set of models, or 


even user specific models. 


On the other side of the coin, for a small project, a data discovery tool may 


not be relevant if you plan to use 100% of your data anyway. 


For mid scale and up, you may prefer your team goes through a few hours 
of training to learn best practices of more complex tools if they are going to 


work on it day in and day out. 


So what makes this so hard? First, most datasets in the wild don’t really 
reflect what’s relevant for commercial projects, or are a mislead. For 
example they may have been collected at a level of expense impractical to a 
regular commercial dataset. This is part of why a “Dataset search” doesn’t 
really make that much sense, most public datasets aren’t going to be 
relevant to your use cases, and of course will not be styled to each of your 


end users. Further most companies keep the deep technical details about AI 


projects pretty hush hush, to a degree noticeably different from more 


common projects. 


All projects still have very real challenges. Table 2-1 segments projects into 


three buckets, and speaks to common properties of each. 


Table 2-1. Data Scale Comparison 
Item Small 
Embedded or Central 


Central Portal 


Volume of Thousands 
Annotations at 

Rest 

In a given 


period of time. 


Media Types Singular, e.g. Multiple Media 


Medium 
Hybrid (Both 
Embedded and 
Central) 
Millions 


the entire teamTypes, e.g. 


is focused 


multiple projects 


around visual. such as visual, 


People A single 
Annotating person or 
(Subject small team 
Matter 

Experts) 


People with A single 
data person 


engineering, 


text etc. 
End user anda 
medium sized 


team 


Large 

Hybrid with >90% of 
annotations from 
Embedded 


Billions 


Multiple Media Types 


End users primarily 
and multiple teams of 


people 


A team of people Multiple teams of 


data hats 


people 


data science, 
etc. etc hats 
Modeling 


concepts 


Schema 


Complexity 


Revenue 


impacted 


System Load 
(Queries per 
Second) QPS 
Chief 


concerns 


Singular Set of named “Routing” or 

model models. channeling data 
automatically to 
various models. Per 
user Customization and 
modeling. 

A few labels Sets of attributes, As complex as product 

or a few may be thousands needs dictate 

attributes. of elements 

<100 elements 

No$ amount Millions or $10s $100s of Millions or 

formally of Millions of Billions of dollars 

attached or _— dollars impacted impacted by work 

pure research by work 


<1 QPS <1000 QPS >1000s of QPS 


Ease of gettingEffectiveness of Embedded Data 
started and _ tooling, Support Collection, Volume of 
ease of use anduptimeof data (“Scale”), 

Cost of toolingtools, Starting to Customization, 


(may be no or think about Security, Inter-team 


low budget for optimization, issues, Assumes each 
tooling) May be planning team is already doing 
a transition to optimizations they are 


major scale familiar with. 


Of course there are many exceptions and nuances but if you are trying to 
scope out the project that may be a good starting point. As you trend 


towards larger use cases the following becomes more important: 


e Embedded, end user annotation 
e Normal system scaling concerns 
e Engaging with multiple teams in a standardized way, such as having a 


similar process for multiple data types 


Transitioning from small to medium scale 


Also applies to planning a mid scale system from scratch. 
A few things to think about 


¢ Workflow 
e Integrations 


¢ Use of more data exploration tooling 


Also see the section on major scale for further directional planning. Not all 


of those concerns will apply or be actionable yet but it’s good to be aware 


of them. 


Large scale thoughts 


If you plan to be operating at large scale a few thoughts to keep in mind: 


How can we push more of the supervision to the end user? A central team 
without an Embedded plan is a variable cost that increases relative to the 
scope of the project, quality expectations, and users. Where as if the focus 
of ongoing maintenance is on embedding it with the end user, the central 


team can be thought of as a fixed startup and quality assurance cost. 


What’s the velocity that the data moves through the system? How long does 
it take to go from new data, to upgraded supervised data, to a new model? 


This is similar to devops velocity. 


Over the last few years the commercial tooling market has changed 
dramatically. What was completely unavailable years ago may now be an 
off-the-shelf option. It’s a great time to rethink what unique value add each 
team is doing? Can you more readily customize an ongoing project then do 


all the plumbing yourself? Do we really need to build this in-house? 


Re-thinking about aligning with open source standards is especially relevant 
for large teams that may have formed well prior to those standards being 


available. 


Do we really have to duplicate this data? Is there a more central way we can 
store this data as it moves around these stages? If you draw out the various 
parts of your ML and training data, you may be very surprised how many 
times data is unnecessarily transmitted (e.g. via events) or even duplicated 


at rest. 


Do you have an existing classic (discovery focused) training data system ? 
Are the concepts and awareness there relevant to this new form of human 
supervision? Supremely naively speaking you can think of a given data 
store as a base layer, upon which data forks into supervised cases and 
discovery cases. So, any plan you made for your overall ML architecture 
years ago, if it didn’t take supervised annotation into account, it needs to be 
completely re-worked. Supervised is so different that trying to shoehorn it 
into existing architecture will not work well. Instead the team needs to 
create anew ML architecture plan that places supervised learning with it’s 


own architecture. 


How many people does it take to surface issues and make a correction to a 
model? For example, a worst practice would be something like: Manual 
feedback process, then a central human annotator, then an ML engineer, 
then manager. In that process, it could take months to ship the next iteration 
of a model. (Imagine if a central team had to be called in every time a user 
wanted to add a word to their spellcheck dictionary.) A best practice would 
be embedded data collection, automatic retraining, automatic validation and 


spot checks by a central QA team. 


What is your policy for sharing of end user supervision data? Is there a 
declared process for when and how shared datasets go into production? For 
example, this could be dedicated logic in your application, where your end 
users, super users, etc. has control. Or for system wide data it may be 
similar to approvals for pull requests. You may already have model 
deployment flows. It’s also important to look at the actual data it’s being 


trained on. 


Assessing ROI of data, especially of incremental additions or deletions. At 
larger scales it becomes more practical and useful to consider the ROI of 
new data relative to business goals. Examples: What percent of stored data 


are we actually using? How much revenue is effected by this dataset? 


What does the shape of the data really look like? For example, if you are 
doing a request/response cycle for every single image, audio file etc, does 
that really make sense? Instead can the data be queried in a central place 


and then streamed? 


Are my data governance policies actually getting implemented across 
teams? Are datasets stored with the same awareness of expiration controls 
as individual elements? Is there alignment between teams or configuration 


of tools for this? 


Installation Options 


Choosing a package, storage, and database are key parts of installation 


configuration. 
Packaging 


Training Data tools come in a variety of packages. The way the code is 
packaged is sometimes an indication of it’s target audience, such as small, 


medium, or large projects. 
Single language specific package 


Some tools may install a single package, for example a python package. 
Usually these are single point “plugins” and not complete systems. These 


tools usually only work for small projects. 


Docker 


Many tools will require docker container, multiple containers, or similar. 
Docker is a way to package software. Docker Compose is a way to group 
multiple packages. In theory any time the docker images are provided you 


can manage those images as you see fit. 


Kubernetes (K8s) 


K8s orchestrates containers. This is the default recommendation for 
production, although there are many other options. The major cloud 


providers have noticeably different implementations of Kubernetes. 


Specifically what may take hours of work on one platform is sometimes 
much easier on others. Training data often represents an unparalleled 
volume of data so expectations around data access, storage, and usage are 


new, and often don’t align with many pre-optimized cloud examples. 


Storage 


Where in the world is the system going to be deployed? If you have users in 
another country, how does that impact your performance and security 
goals? If a cloud storage option is not available, what types of local options 


will meet your needs? 


Database 


Diffgram uses PostgreSQL by default. Many other databases are available. 
Usually three are at least 3 different groups of people involved in setting up 


and using the system. 


e Admins 
¢ Technical (Engineering, Data Science) 


e Annotators 


Data configuration 


There are various configurations to be aware of for specific media types. 


For example, for videos, storing individual frame on demand or not. More 


generally there may be choices about what “artifacts” to be stored, such as 


thumbnails or web optimized versions. 
Versioning resolution 


How many versions - potentially all - of previous annotations are required? 


Should every change be recorded? 


In some systems, this may be critical, or simply a useful feature. As a rule 
of thumb, turning on complete versioning will likely result in at least 80% 


of the database being composed of these soft deleted annotations. 


Data lifecycle 


Do you have to delete the data within a certain period of time? Or must you 
retain it for a certain length of time? Can some be automatically archived 


after a period of time? 


Annotation Interfaces 


Naturally there needs to be an interface that a human uses to instruct and 
supervise the machine. Both Portal based and Embedded annotation 
interfaces continue to evolve. In some ways interfaces are trending towards 
having relatively similar feature sets, but this remains a very opinionated 
area. In my opinion one of the most important things is how well the 
interface is embedded and presented to the end user. That the user can 


provide meaningful supervision in the most relevant context. 


The spectrum of complexity of types of interfaces and variety of yet-to-be 
standard expectations, together make discussions around annotation 
interfaces challenging. For example, annotations centered around attributes 
(like a regular form), are generally more straight forward then video and 3D 
systems. However, “forms” themselves can be quite complex. And even 
within the realm of video, the expectations between being able to merely 
play a video as a reference point, annotating specific moments in time, and 


more complex cases, are all different in scope. 


Circling back to our project scale concept, for a small scale project, the 
overall stylistic feel and out of the box feel may matter. With a large scale 
project, the embedded interface will be customized, and more likely 
engineered to specific requirements. For example choosing which 
components show up where, color and style (CSS) themes, etc. So for a 
large project, the tools ability to be customized, engineered, and developed 


on is important. 


Modeling Integration 


Your Training Data system needs to communicate with your ML modeling 
systems. With Diffgram this is done through Workflows, covered in later 
chapters. While sometimes modeling systems may present some 
superficially similar views, such as outputs with bounding boxes, they 
usually do not support serious training data work. Modeling integration is 


related to Streaming data, but they are different concepts. 


Multi-User vs Single User 


Modern systems, like Diffgram, are multi-user by default. A system for a 
single user (e.g. Sloth) it probably is not part of the modern training data 
paradigm and is likely focused on the pure user interface portion. The 
primary reasons for needing multi-user are expertise and volume. Systems 
can be operated for a limited prototype by a single user, or may run on a 
single local machine for testing purposes. Much of this chapter is focused 


on multi-user, and team based systems. 


Integrations 
There’s a few broad categories of integrations. 


e System level, one time setup 
e Plugins 
e Installable things, that run on frameworks or on your Training Data 


platforms hardware. 


Parts of Training Data tooling are buried in the technical stack while others 


are surfaced to end users. 


The most basic concept is that you must be able to get raw data and 


predictions in and annotations out. Considerations include: 


e Hardware. Will it run in my environment? Will it work with my storage 
provider? 

e Software Infrastructure. Can I use the training systems, analytics, 
databases, etc. I want with it? 

e Applications and Services. How well does it integrate with my systems? 
Backend and frontend? 

e Plugins 

e What types of custom integrations through APIs and SDKs are 
available? 


¢ How do I get the data back and forth to the training data system? 


Some systems offer greater degrees of integrations. UI based interactions 
with integration processes, not just for setting up keys, but also pulling and 


pushing the data. 


Scope 


As this ecosystem continues to evolve there is a broadening boundary 
between the scope of users and data the tooling is designed to work with. 
Some tools may cover multiple scopes. In general tools lean either towards 


single users or truly many users. 


As shown in Fig 2-3 one way to think of this as a continuum with two 


major poles - point and suite solutions. 
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Figure 2-3. Point Solution and Suite Continuum 


Note: Some of these icons are explained throughout the book. Naturally any 
system will have some concept of input output. So when we have an Ingest 
icon, it’s to indicate something that an entire team would work on at a big 
company. Further icons like Secure refers to security products like blurring, 


PII etc. not the general concept of security. 
Platforms and suites 


For mid-sized, large teams and companies with multiple teams 


From a very high view point the main differences in psychology with these 


systems: 
#1 View training data as a dedicated discipline 


Even if they have other integrated Data Science products, services etc. they 


draw a clear line about what’s training data and what’s not. 


#2 Offer a suite of media types and lateral supports 


Usually you can tell it’s a broader system because it will cover more - or 
even all - of the media types. Similarly for lateral supports like storing, 
streaming to training, exploring, etc. there will likely be more coverage. 
Given the vastness of the space I use the word coverage since even the most 


advanced and largest platforms have gaps. 


#3 Bread and depth 


Further expanding on #2, some solutions may offer great coverage on media 
types, but only to a relatively superficial depth. As a solution leans more 
towards this end of the spectrum it’s depth of offerings in each category 


continue to grow. 


Customization 


The big product difference here is that often these tools assume they will be 


customized, either with more built-in customization options through 


configuration, or with more hooks and endpoints to naturally customize it 


through code. 


Generally, systems designed for large scale 


¢ Customization. Virtually everything is up for user configuration, from 
how the annotation interface looks, to how workflow is structured, etc. 

e Installation. It’s assumed that installations will be done by, at least 
overseen by the customer. Who has the encryption keys, where the data 
is stored at rest, etc. are part of the discussion. Expected dedicated and 
clear security discussion. 

e Performance expectations and capacity planning are done. Any software, 
no matter how scalable, still requires more hardware to scale. 

e Many users, teams, data type etc. are expected. 

¢ Don’t offer integrated training. Typically because the quality integrated 
training delivery is below the expectation. Typically because a dedicated 


team is dedicated to doing the training. 


Cautions 


e These systems can be very complex and powerful. They usually take 
more time to set up, understand, and optimize for your use case. 

¢ Sometimes head to head in a specific function the larger system may not 
be fair as well. One reason for this is because what may be a high 
priority for a point solution to fix, maybe a much lower priority in the 


scope of a larger system. 


e Large systems, even with potentially stronger quality controls, have 
more bugs. What may be hard to break in a small system due to it’s 


simpler nature may break in a larger system due to the complexity. 


Point solutions 


Distinguishing features 


e Often mix Training Data and Data Science features. For example, it may 
be pitched as “End to End” or “Get a Model Trained Faster”. 

¢ Focus on a single or a small handful of media types. 

e For single users or small teams. This usage assumption cascades to 
features around who creates labels, ease of use etc. 


e Software as a service or deployed locally on one machine 


Usage 


e Most appropriate usage includes experimenting with an end to end 
demo, or if it works well enough and you don’t have the resources to use 
other options. 

e Usually, by their nature of being simpler, these tools are faster to set up 
and get a “result”. If it will be the result you want is often more 
questionable. 

e Having some form of automatic training built in. Automatic training is 
not an automatic negative, however, usually mid sized and larger teams 


want more control and so it must be taken with a grain of salt. 


¢ Sometimes point solutions can be of great quality in their specific 


domain. 


Cautions 


e These tools usually limit - either by technical or intentional political 
limits - what type of results can be achieved. For example, they may 
have a method to train bounding boxes, but not keypoints. Or vice versa. 
This extends to media types too, they may have a method for images, but 
none for text. 

e Usually are not appropriate for better-resourced teams. May be lacking 
many of the major feature areas, such dedicated task workflow functions, 
ingest, arbitrary store and query etc. 

e Often are not very expandable or customizable relative to heavier weight 
solutions. 

e Security and privacy are usually limited. Specifically for example the 
terms of service may allow these companies to use the data you create to 
train other models, sometimes projects are public by default if not 
paying, etc. Ultimately there must be trust in the service provider with 
your data. 

e¢ While the quality may be high, the need to string together the point 
solution with other tools often creates extra work. This is especially 
prevalent in a larger firm where the tool may be appropriate for one team 


but not another. 


Cost considerations 


e These types of tools often have a “long tail” of costs. They may have a 
cost per annotation. Or it may be free to train the model, but a cost to 


serve it (and no option to download it). 


Tools in between 


Generally most tools trend towards one of the ends of the spectrum, either 
the smaller as already mentioned or the larger use cases as I will cover ina 
moment. There are also some tools that are somewhat in between either of 


those poles. 


Generally, the progression to look for is: 


¢ More awareness of training data as a separate, stand-alone concept. 

e More awareness of multiple solution paths. Less “one true path” and 
more flexibility. 

¢ Greater percent of landscape coverage. For example, may have more 
integrations and flexibility. 

¢ More enterprise friendly concepts. May offer local installation, or 
customer controlled installations. More focus on customization and 
function over golden path mentalities. 

e There may be some contractual guarantees around data added. 

e These tools may be able to provide serious results and be appropriate if 


your team has outgrown smaller tools but doesn’t yet have resources for 


larger tooling. 


A Suite is not automatically better. However, it is usually hard for smaller 
tools to “step up” to a higher level whereas most larger systems can often be 


used only in part and they fit this middle path quite well. 


Where is the machine learning? 


The best platforms offer a solution in between the two extremes of “brittle 


single autoML” and “do nothing”. 


Essentially this means focusing on the Human Computer Supervision side. 
How to get data to and from machine learning concepts. How to run your 
own models, integrate with other systems such as AutoML, dedicated 


training and hosting systems, resource scheduling etc. 


Hidden Assumptions 


Training Data tools bring many benefits and are of critical importance. To 
reap the benefits however you still need to consider these assumptions. 


Some of these are usually True, others usually False. 


Before we cover the details on regular considerations it’s worth being aware 


of these assumptions. 


True: Meet the team 


End users, admins, dedicated annotators, engineering, product, executive 
and more. This is a product that gets touched by many people in the 


organization, often with very different goals, concerns and priorities. 


True: You have someone technical on your team 


Someone needs to install, set up, and maintain the system. Even for 100% 
service based tools with the latest wizards there’s still an assumption that at 


least one person is technical and one person understands training data. 


True: You have to do setup. 


Most of this tooling requires setup. While there are a variety of tools that 
carve out a narrow defined area and reduce the amount of effort required, 
that’s not really data programming, but just consuming a narrowly defined 


service. 


True: You have a budget 


All tools have some form of costs. Modern commercial open source tools 


have a licensing cost. All have hardware and setup costs. 


True: You have time 


The complexity of some of this tooling is quite astounding. As of 2022 


open source Diffgram has over 1,400 files and 500,000 lines of code. 


False: You must use graphics processing units GPUs 


Training a model often benefits from having a processing accelerator like a 
GPU. However actually using this in automations does not require a GPU. 
Also - training in the context of a limited dataset does not benefit as much 


from GPU power because it’s a smaller set. 


False: You must use automations 


Automations are sometimes useful. They are not required. Improper use can 


have negative results with bad feedback loops. 


Security 


According to a 2022 Linux foundation report “Security is the #1 priority 
that influences what software an organization will use. License compliance 


is the #2 priority.”! 
Security architecture 


For high-security installations it is usually best to host your own training 
data tools. This allows you complete control to set your own security 
practices. You can control the encryption access keys and location of all 
aspects of the system from network to data at rest. And of course you can 


then set your own custom security practices. 


Attack surface 


Installation is the starting point since networking is 101 for cyber security. 
The attack surface of an inaccessible network is low. So for example if you 
already have a hardened cluster you can install your software and use it 


within that network. 


Security configuration 


Your security posture is dependent on your configuration. For example 
storing objects by reference vs ingesting them directly into a defined 
bucket. Using OIDC or not. The specific implementation of BLOB signing. 


You can configure this to your desire. 


Security benefits of an installed solution 


e You can set real security, including all keys, based on your real and 
current security posture. 

e You control network security, the annotations database, the raw data, 
everything. 

e You control the entire keychain. 

e You are aware of the other threats and you can take action such as 
pinning specific versions. 


e You can usually inspect the source code 


User access 


One of the first things that comes to mind is often users’ abilities to access 
samples. Consider a company that has a smart assistance device. Perhaps a 


reviewer listens to audio data when the device misfired and the microphone 


came on accidentally. 


Or consider someone correcting a system to detect baby photos etc. There 


are many levels to consent 
On the consumer side there are generally these big buckets 


¢ Do not have consent to use directly (anonymized only) 
e Do have consent to use to train models - may be limited by time 
e Have consent, but data may contain Personally Identifiable Information 


(PIT) that may be sensitive if included in the model. 
On the commercial side, or more business-to-business type applications 


¢ May include confidential customer data. This commercial data may 


potentially be “more valuable” than any single consumer record. 


There may be government regulations such as HIPAA or other compliance 


requirements. 
What a mess right? 


Other day to day considerations that may come up: 


e Can annotators download the data to their local machine? 
e Should an annotator be able to access records after completing 


(submitting) them? Or are they locked out by default? 


On the software side there are generally two major models that most 


approaches fall into 


Task only availability 


This means that as an annotator user, I can only see my current assigned 


task (or set of tasks). 


Project level 


As an annotator I can see a set of tasks, or even multiple sets of tasks. 
As a project administrator the two big decisions are basically around 


e Structuring the data flow so that only data that is tagged as having 


consent, and/or meets other PII requirements, ever enters the annotation 


task flow at all. 


¢ Deciding at what level annotators see tasks. 


Data science access 


Naturally data science must access this data at some point to do work on it. 
Often data science gets a fairly free hand to “look at” the data. A more strict 


system may allow only sending a query and receiving a sample, with the 


bulk of the data being sent directly to the training apparatus. This bypasses 


the data scientist’s local machine or user specific server. 


It’s worth considering that a single breach of a data scientist’s access is 
often more severe by many orders of magnitude than annotators. An 
annotator, even if able to bypass various security mechanisms and store all 
the data they see, may only see a small portion of the data of a large project. 


Whereas a data scientist may have 100s of times more access. 


Root level access 


A super admin type user, IT administrator, etc. may have some levels of 
root system access. This may be classified as a super admin in the 


application, have direct database access etc. 


Explainability *sidebar* 
Professor at MIT CSAIL Lab, Regina Barzialy says: 


“It’s like a dog, which can smell much better than us, explaining how 
it can smell something. We just don’t have that capacity. I think that 
as the machines become much more advanced, this is the big 
question. What explanation would convince you if you on your own 


cannot solve this task?” 


The concept of explainability is important, but usually reserved more for 


the machine learning model analysis side. 


Open Source and Closed Source 


Open vs Closed source is an argument as old as time. I’d like to take a 
moment to highlight some specifics I have seen relative to this training data 


ared. 


Open and closed source annotation takes on a special consideration for the 
rapidly changing training data landscape because the majority of this new 


generation of tooling is closed source. 


There have been many open source annotation tooling projects - some over 
10 years old. However in general most of those projects are no longer 
maintained, or are very niche and not general purpose tools. Currently, the 
two general purpose “second generation” annotation tools in open source 
are Diffgram and LabelStudio. To be sure there are many other tools - but 


most are focused on very specific considerations or applications. 


Open source software has many advantages - especially in this privacy 
focused area. You can see exactly what the source code is doing with your 


data and be sure there is no nefarious activity afoot. 


Open source does have some disadvantages. Most notably the initial setup 


of the system itself may be more difficult (not the application setup itself 


which is similar in either case, just the actual installation of the overall 


software). 


The commercial costs of both open source and closed source may be 
similar, just because the code is open does not mean the license is 
unlimited. Ease of use is often similar in the context of commercially 


backed projects. 


The costs of hosting open source are controlled by you. In general, the cost 
of hosting is rolled into the cost paid to a commercial provider. This is a 
nuanced tradeoff but in practice often is similar at small and medium scales. 


At great volumes usually the more control you have the better it is for you. 


Open source may tend to have greater compatibility, since often there is 
more use - from free users - who still run into issues and write tickets. This 


can mean less technical risk. 


Costs are also similar. Commercially backed open source projects usually 
require upgrade to a paid version at some point during commercial use. 
Sometimes there may be an option to forgo paying, but at the very least that 


means less support. 
Choose an open source tool to get up and running quickly. 


Some tools install in dev in minutes and a moderate production setup in 


hours or a few days. Most have optional commercial licenses that can be 


purchased. This is faster than talking to a sales team and gives a truer 


account than limited SaaS trials. 


See the forest from the trees. 


Environment setup and initial expectation is one of the hardest things. It’s 
easy to over-fit on perceived setup/first impressions when much of the 


value is delivered over a long period of time and to multiple user groups. 


Capability over optimizations 


What is an optimization for some may be suboptimal for others. For 
example, an extra “confirm” prompt when completing a task may feel like a 


huge burden to some, while in others it’s a crucial step. 


Consider that Excel has over 200 popular shortcuts. My guess is most users 
only know a small fraction of them, yet are able to use excel perfectly fine 
for their jobs. Some people get very concerned about optimizations, like 


specifics of hotkeys. 


As annotation continues to become more complex, and new users enter the 
picture, there is a shift away from shortcuts, and more to making sure the 
UI has the capabilities, and shows a reasonable context so the user can 


make use of those capabilities. 


Ease of use in different flows 


The ease of updating existing data is often much different from creating all 


new annotations. 


Vastly different assumptions 


I tried one popular annotation UI where the delete key deleted the entire 
series across all video frames. This would be like painstakingly crafting an 
entire spreadsheet, only to bump the delete key and have it delete the entire 
sheet! Even though I was just testing, did I ever get a jolt when that 


happened! 


Of course, someone else could argue that it’s easier to use, since I need only 
select an object, and click delete, I don’t have to worry about the concept of 
a series, or that it appears in multiple frames. Again - what’s right here will 

depend on your use case. If you have complex per-frame attributes, a single 
delete for what could be days of work is probably bad. Conversely if you 


have a simple instance type in some cases perhaps it’s desired. 


Again, customization by admins and users comes to the rescue. Do you 
want to see the previous annotation in the next frame of the video? Or you 


don’t? Choose what’s right for you, set it and forget it. 


Look at settings, not first impression 


Even seemingly simple things, like the font size, position, and background 


of label tags, are all very dependent on the use case. For some, seeing any 


label visually may get in the way. For others, the entire meaning is in the 


attributes and not showing it slows down progress significantly. 


Same with polygon size, vertex size etc. For one user, they may be unhappy 
if the polygon points are hard to grab and move, another may want no 
points at all so a segmentation line on a medical image can be perfectly 


seen. 


If there’s one enduring theme there is to look less at the appearance of the 
UI upon first glance, and more at what settings can be adjusted - or could be 


added - to meet your desired needs. 


Is it easy to use, or just lacking features? 


Another trade-off is that some vendors simply decline to enable features by 
default, requiring each flow to be planned out. For example this may mean 
that instance types aren’t available in video, or settings may not exist, etc. 
When evaluating, think hard about the ongoing usage and needs for more 


complex scenarios and if it will handle it. 


Customization is the name of the game 


There is increasingly an expectation of customization at all levels of the 
software. From embedding interfaces, annotation settings, admin 
configurations, to actually changing the software itself. Try to be aware of 


what’s “hard” and what’s “easy” for your given provider. 


For example, for a closed source provider, adding a new storage backend 
may be a low priority. With an open source project you may be able to 
contribute this yourself, or encourage others in the community to do so. 
Also you may be able to better scope out and understand the impact of 


changes and costs involved. 


On the enterprise side of it, try to understand what the core of the software 
really is for your use case. Is it a completely integrated platform? Is it the 
data storage and access layers? Is it the workflow or annotation UI? 
Because some of these tools vary so dramatically in scope and maturity it 
can be hard to compare. One tool may for example have a better spatial 
annotation UI, but be substantially lacking in many other dimensions like 


the ability to update data, ingest, query data etc. 


As a small story, a user noticed that when a task was already completed, 
pushing a recently added ‘defer task’ button, led to a poorly defined state in 
the system. I agreed this was an issue. The fix was one line of code - a 


single if statement. 


On the other hand, if a vendor doesn’t offer major features like data 
querying, streaming, wizard based ingestion etc, those may all be multi- 
month projects, multi-year epics, or even never be added at all. Because this 
is anew area, with vastly different assumptions and expectations, I really 


encourage you to first consider the major features, and then look at the 


speed of updates and execution on improvements. A vendor that can adapt 


quickly is especially valuable in this new area. 


Another had experienced in a different UI that deleting singular points were 
not “recoverable”, meaning that if say a hand was occluded on a keypoint 
figure, and I marked it that way, that if I went to undo it I couldn’t get it 
back. In Diffgram, the way the system was setup it was easy to maintain 


this on a per point basis. 


History 


Open Source Standards 


By my estimate in 2017 there were likely less than 100 people in the world 
working on commercially available tools for training data. In 2022 there are 
over 1,500 people across at least 40 companies working directly for training 
data specific focused companies. Sadly the vast majority of individuals 
work on separate projects in closed source software. Open Source projects 
like Diffgram offer a bright future of shared access to Training Data tools 


regardless of financial status of the country of living. 


Open Source tooling also shatters illusions around what is magic and what 
is standard. Imagine spending more budget for a database vendor that 
promises 10x as fast queries, only to find out all they do is inject extra 


indexes. Now in some cases that could have value, but you would at least 


want to upfront know you were paying for ease of use, not the concept of 
indexes! Similarly training data concepts like pre-labeling, interactive 
annotation, streaming workflows etc. are brought to the forefront. More on 


this later in the chapter. 


Realizing the Need for Dedicated Tooling 


As an industry when we first started working with training data, the first 


rush was to just “get it done” to start training models. 


The questions were along the lines of “What was the most minimal bare 
bones UI for a human to jam annotations on top of data and then into a 
format a model could use?” This is when people first started to realize the 
power of modern machine learning methods and just wanted to see “will 


this work?” “Can it do that?” “Wow!”. 


The problems came swiftly. What happens when we move the project from 
research to staging, or even production? What happens when the annotator 
is not the same person writing the code, or even located in the same 
country? What happens when there are hundreds or even thousands of 


people annotating? 


At this point, people often start to realize they need some form of dedicated 
tooling. Early versions of training data tools answered some of these 


questions, allowing remote work, some degree of workflow, and scale. 


Quickly however more questions enter the picture as pressure on the system 


increases. 
More usage, more demands 


To put this very plainly, the moment you have a significant amount of 
people spending their full working day, eight hours a day every day ina 


tool, everyone’s expectations and the pressure increase. 


Iterative model development, for example pre-labeling, puts pressure to 
continually improve training data. While this is desirable it puts increased 
pressure on the tooling. Because the more often automation approaches are 
used, the more pressure. Static pre-labels are just the tip of the iceberg. 
Some automations require interactions, further stressing interactions 


between data science, annotators, and annotation tooling. 


Many features have been added to address these needs. As tooling providers 
added more features, the ability to have a smooth workflow became a new 
issue. Too many features, too many degrees of freedom. Now the 


responsibility to limit the degrees of freedom has increased. 


Advent of new standards 


Tooling providers have now had some years of experience and learned 
many things. From creation of new named concepts specific to training data 


to multitudinous implementation details. These off-the-shelf tools make the 


overwhelming into something manageable. This enables you to use these 
new standards and work at a level of abstraction that’s relevant to you and 


your project. 


Yes, we are at the early stages of standard training data. As a community, 
we are developing everything from conceptual ideas like Schema, to 
expected annotation functions, to data formats. There is some agreement on 
what the scope of training data tools is and what standard features are, but 


there is still a way to go. 


While there is naturally some overlap, most of the functional areas have 
differences depending on the media type. For example, automations for 


text, 3D, and images, are all different. 


The realization here is that bespoke rube goldberg machines may answer 
some of the complexities but fail to cover the vast space needed. Putting 
aside any historical interest, as someone making a decision today, the 


context of the progression helps ground where the value is coming from. 


I like to think of this as the 30,000 foot view. So if you are thinking about 
an automation improvement, it’s worth reflecting if it will apply to all of the 
media types that are relevant to you. It’s a reminder that any weakness in 
one area is likely to create a bottleneck. If it’s difficult to get the data in and 


out the value of a great annotation workflow is diminished. 


Suite 


Where are you in your journey of needs? Do you already see the need for 
dedicated tools? For the best quality tools you can get? For a suite that 


covers the vast training data space? 


We all like familiar things. In the same way that office suites offer a similar 
set of expectations and experiences, from the UI to naming conventions, 
training data platforms aim to do the same. To create familiar experiences in 


multiple formats, be it text or images. 


Naturally at any given moment a single team may be focused on a specific 
data type or types (multimodal). The familiarity here helps well beyond 
this. New people joining the team can more quickly get up to speed, shared 


resources can go between projects more easily and more. 
Generally the progression goes from 


¢ Realizing the need for dedicated tooling 

¢ Realizing the complexity of the technical space requires the best possible 
tooling - not just anything 

¢ Realizing the complexity of the user space requires familiarity and 


shared understanding 


As I will explain in more detail in Chapter 7, if you are considering having 


a director of training data position established, having familiar tooling is of 


critical importance to this team. The same annotator may easily shift 
between multiple types of media and projects. This also helps address the 


difference between data science concerns. 


To differentiate a potential confusion. Having a suite does not mean an “all 
in one” solution for everything. Data science may have it’s own suite of 
tools for training, serving etc. It also does not preclude point solutions for 
specific areas of interest. It’s more like an order of operations idea, we want 
to start with the biggest operation, the main suite, and then supplement it 


where required. 


Summary 


You now have a roadmap. And a general understanding of the process to set 
up your Training Data system. From Installation, Annotation, Embedding, 
through to ML Workflows and Optimizations. I provided a brief overview 
of Training Data tools. And then discussed trade-offs and considerations in 


more depth. 


Do you feel comfortable that you have a grasp of the differences between 
small, medium, and large projects? If not I’d encourage you to review the 
Trade-offs and scale section before reading on. Training data approaches 
can very quite dramatically depending on the project size, so it’s important 


to frame your current learning goal there upfront. 


Finally as you can see from the history section Training Data tools have 
come a long way. And we continue to see improvements in standards. 
Getting tools well setup is the practical implementation of thinking about 


Schema, Raw Data, Quality, Integrations, and the Human Role. 


Now that you have an overview of the setup process, available tools, and 
trade-offs, it’s time to take a deep dive into Schema. What are labels and 
attributes? What are spatial representations? How can we implement 
Schema in our Training Data tools? What is the relation of Schema to ML 


Tasks and Raw data? Find out in the next chapter. 


https://linuxfoundation.org/wp- 
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Chapter 3. Schema 


A NOTE FOR EARLY RELEASE READERS 
With Early Release ebooks, you get books in their earliest form—the 
author’s raw and unedited content as they write—so you can take advantage 


of these technologies long before the official release of these titles. 


This will be the 3rd chapter of the final book. Please note that the GitHub 


repo will be made active later on. 


If you have comments about how we might improve the content and/or 
examples in this book, or if you notice missing material within this chapter, 


please reach out to the editor at jleonard@oreilly.com. 


Schema Deep Dive Introduction 


What do you want your AI system to do? How will it accomplish it? What 


methods are you going to use? 


In this chapter, I dive into some of the foundational concepts around 


Schema. 


The real world is messy. Commercial applications require a level of detail 


that’s hyper domain specific. There are many ways to structure this. In 


general, these structures are defined in the Schema. Further, the Schema 
provides “pivot points” to adapt and change sub-components over time to 


better fit current needs. 


The Schema is important to get right because the rest of the system, 


including raw data, is defined in relation to the Schema. 


Schema is the paradigm for encoding Who, What, Where, How and Why. 
Schema is the overall representation of Labels, Attributes, and their 
Relation to each other. It’s how we represent the meaning of what 


something is, where it is, and more. 


This builds on the high-level concepts of Labels and Attributes introduced 
in Chapter one. After I will map these training data concepts back to 


Machine Learning tasks. 
In this chapter, you will learn 


e Mental models to set up your first Schema. 
¢ Overview of the landscape of expansion directions for your Schema. 
e Trade-offs of popular methods and tasks. 


e Specifics of some high-level ideas from chapter 1. 


Let’s dive into Schema! 


Labels and Attributes 


Structuring the where, what, and how. 


What Do We Care About? 


Generally, we care where something is, what it is, and how it relates to other 


things. 


Labels and Attributes are the tools we use to express “What” something is. 
In the next section, I will introduce Spatial types to discuss Where 


something is. 


The concept of representing What something is can expand with near- 
infinite complexity. Whereas the spatial location aspects generally have 
more obvious limits to their expansion. In other words getting the What 
right is a greater ongoing challenge than the mechanical specifics of Where 


something is in a document or image. 


Introduction to Labels 


Labels are the “top-level” semantic meaning. In the base cases, they may 
represent only themselves. Eg a label “car” may map to literally “car”. In 


most cases, Labels organize a set of attributes. 


For technical readers, to help ground this idea, I like to compare it to SQL 


design. 


Table 3-1. Comparison to SQL 
Intuitive ConceptTraining Data SQL 
Set Label Table 


Attribute Attribute Column 


Of particular note here, Tables don’t usually have a “Type” where Columns 
of course do. In this same sense, a Label doesn’t have a type where as 
Attributes do have types. Another way to think of a Label is like a folder or 


set of attributes. 


Interestingly, in E.F, Codd’s “The Relational model for Database 
Management”, he mentions that Columns were originally thought of as 
Attributes.4 While far from a perfect analogy it helps convey the general 
idea. Continuing that line of thought, Attributes can be shared between 


Labels, which is roughly analogous to Foreign keys. 


When an end user is annotating, organizing sets of attributes into labels also 
can help hide irrelevant options. For videos, labels help constrain 


relationships and organize sequences. 


It’s expected that some of the specific organizational principles discussed 
here will be implementation specific and change over time. In general, the 
broad strokes will be similar. As this new area of Training Data continues to 


be refined, standards will continue to evolve. 


Next, we will talk about Attributes, which is where the bulk of the Schema 


definition usually lives. 


Attributes Introduction 


Attributes represent the bulk of the “What is it”. The heart of the human 
encoded meaning and the technical definition. Attributes are usually 
defined to include, at minimum, the following structures: The human 
question or prompt, form type, and technical constraints. This set of human 


and machine definitions together makes one “Attribute”. 


Training Data Attributes may appear superficially simple or similar to other 
technologies, however, in practice there is a lot of complexity at this 


intersection of both human and machine-centered definitions. 


In the same way that Training Data is a combination of raw data and 
human-defined meaning, Attributes are a combination of technical 
definitions and human centered definitions. In order to be useful for 
Machine Learning, both the technical definition and the human definition 


are needed. Attributes are the joint representation of those two things. 


More technically speaking, Attributes can be thought of as well-defined 
Forms or as “Data Classes meet UI specifications”. One way to wrap your 
head around this is to think of a spectrum between Forms and Classes and 


put Attributes somewhere on that spectrum. A Form can be arbitrarily 


complex but usually isn’t thought of as having defined Types, like a Class. 
Further while the implementation of a Form may have validation, it’s 
usually end user validation, not a formal database Constraint. Because the 
Training Data is relied upon by the ML program, and is usually expected to 
be queryable, Attributes have more “structure” than a typical Form. 
Conversely, due to the expectations of Human control, Attributes usually 
have an flair of “form like” behaviour, more than a typical programming 


Class or database Table definition would have. 


In practice, Attributes fill a need for Training Data that is distinctly different 
from other technologies. As this area continues to evolve, I expect that 


Attributes will continue to expand. 
Typical attribute properties 
Usage 


A single annotation may have multiple Attributes. 


The question for humans 


A question for a human to consider. For example “Is this person happy?”. A 
human will be selecting these values and or reviewing them. In that context 
usually, each group will have additional information, such as Prompt, 


Display order 


Form type 


Such as Tree, Multiple Select, Select, Text, A child Group, Date etc. This 
straddles that line between User Interface Form types and Data Types. For 
example, a Slider UI control could collect Float or Integer data. In the most 
general case Attributes are often treated as Strings. Where needed both a UI 


type and a Data type can be stated. 


Constraints (or bounds) 


Further there may be Constraints with the form collection, for example, 
how many are allowed to be selected. Continuing the slider example, it 


would have a lower and upper bound. 


Predefined selection 


A predefined selection. Generally, an administrator defines valid values. 


Templates 


Usually Attributes are defined as some kind of Template. Values are unique 


to each Instance and may be concrete or a reference. 


Typically any kind of number collection, free text, date, etc. would be 
concrete, e.g. 3.14. Whereas when possible, if a known set is provided for 


say a Selection from a list of 6 elements, a reference ID can be used. 


Examples of generic attributes 


e Occlusion (Blocked/Out of view) and Truncation (Out of frame) 
¢ Depth / Hierarchy of Labels/ meaning. Eg What type of Action (Moving, 
Jumping, Running) 


e Directional Vector (eg front/back/side) 


Occluded by 
Red Vehicle 


fh 


= S36 


Truncated 
(Out of Frame) 


Figure 3-1. Example of Occlusion and Truncation 


To make this easier we use Constraints. We may constrain for example that 
a car can have a directional vector but a sidewalk does not. A disease may 
have multiple types, or we may constrain to only choose a single type. In 
the simplest case, if we imagine Labels “cat” and “dog” being the 2 choices, 


we are constrained to limit ourselves to only those two choices. 


Schema complexity trade-off 


The complexity of the Schema affects the ML program and human 
supervision. The overall Schema may be different from the supervision that 
is Shown to users at any given moment in time. In that sense, the ML 


predictions and the end-user correction may be decoupled. 


Schema Complexity 


Less Time Smarter 
Less Complexity More Performance 


Less Items Ay More Items 


Figure 3-2. Spectrum of Common Schema Trade-off Considerations 


The benefit of more Schema is often a “smarter” system. For example, if we 
don’t have labeled data on “is offensive?” then we wouldn’t be able to train 


a model. More labels also offer more insight into performance. 


Systems ten to expand how many labels and attributes they have over time. 
A good “back of the napkin” way to understand how complex the Schema 
and its usage of the Schema is to multiply the number of Attributes by the 


number of Annotations. 


The Media type may affect the complexity trade-off greatly. For example, 


with images, if every object needs to be labeled, then the complexity of the 


Schema may be multiplied by the volume of objects per image. In a Video 
context this may be further multiplied by each frame. The more complex 


the Schema is, the more complexity for model training. 


Attribute depth 


Imagine a grocery store checkout system. There may be many sizes of a 
specific brand of cereal box and we aim to identify the specific stock 
number (SKU). In this context, we may wish to “database back” our 


attributes. Such that as inventory changes, the options presented change. 


Further that we load thousands or even millions of options. Putting aside UI 
challenges regarding searching and selecting each attribute - this is a 
perfectly reasonable solution. At the time of writing there is no “right” 
answer here. A system may easily have a handful of attributes, or 


thousands. 


Schema depth may also be affected by conditionals and complex 
hierarchies. At any point we can expand an attribute into a child node or a 
list of nodes. For example, selecting “No” could expand a “No - Options” 
node. An option for any given selection is to expand into a child node. 


Generally the same principles discussed here apply. 


Relationship to Spatial Types 


Imagine for our sports fan detector, we have a “Person” as the top level 
object. And then for detecting if they are wearing an offensive shirt or doing 


an offensive action, we have an “is_offensive” attribute. 


One consideration here is that if an ML Task like semantic segmentation is 
chosen, and a polygon is used, do the attributes chosen to correspond to 
every pixel in the outline? Each instance generally has one spatial type. So 
while we are thinking about the shirt, the actual data is for the whole 


person. 


This risk is because while we as humans know the offensive part is the 
shirt, what we have actually encoded into the machine is that the whole area 
is offensive. To help visualize this, consider that to the machine, these 


things are functionally equivalent. 


Person 
» Is offensive? Yes 


VN (oo\ 


Figure 3-3. Example of a risky schema for unwanted bias. The leftmost frame is an example of what 
is shown during annotation time. At the surface, it may appear reasonable. However, the center and 
right images are two examples of similar data, but with parts obscured. Since at train time, the entire 
image was declared to represent offensive, the ML program may predict based on just the human 
face, and not the t-shirt as may be assumed. 


The model may just as easily use a feature from the racial features of the 
face instead of the t-shirt, making the risk of it classifying someone who has 
some facial or ethnic properties as “offensive”. This relationship is slightly 
more obvious with labels because they are “top-level”, but can be more 


subtle in “buried” groups of attributes. 


Consider that some systems (or even people) will use this type of training 
data without looking at the samples, or critically analyzing its Schema 
construction. To avoid this, we want to make sure the Schema encodes 
what’s actually present, as accurately as reasonably possible, and when 


possible that we record assumptions made around the system. 


One way to avoid spatial bias 


__— Person 


Is offensive? Yes __——— Person 


Offensive 


Figure 3-4. Using a Box to Avoid Spatial Bias (Right, Yellow) 


Imagine you were teaching a child about this. I would say something like 


“That person is wearing an offensive shirt”. Not “That person is offensive”. 


While the latter may be true, it’s not accurate to describe the logical 


situation as observed in the image. 


One solution to this is to break out the label into two. Then the spatial 
location (eg here of the Yellow box) can better represent the label. The key 
point here is that we are trying to get the spatial location to be 


representative of the item of interest - and only the item of interest. 


Now perhaps in the future certain training methods will allow for this. 
However, we can relatively easily mitigate this risk today at the training 
data level simply by being aware of this and being more precise in how we 


lay out our classes. 


To further appreciate the importance, imagine an automatic alarm system 
that is monitoring for a {gun, knife, bomb} or some other real time threat 
(such as airport security). If the system is trained on images of people with 
guns but doesn’t separate the person from the gun, there is a risk of it 
falsely triggering when a person of a certain ethnic background is present 


(when no gun is!). 


If the spatial area annotated is of the true object we care about, in theory, all 
of the background data should not be relevant. In practice this may not be 
true however, so awareness of the overall data being used is still a joint 


responsibility. 


Joint responsibility 


If you are a data specialist you may think this doesn’t apply to you. Because 
of course this is a “Schema level” concern. Equally, if you are designing the 
Schema, you may assume that the data specialist will notice this and raise 
an alarm. Clearly this is a joint responsibility to identify when the Schema 


is at risk of creating bad data. 


Another reason to be aware of the actual data and data’s relationship to the 
Schema, is to create better quality data, which results in faster and better 


models. 


For example here, you may need a lot more data to identify offensive items 
if you are doing it at the person level. By scoping the spatial location to 
better match the object in question we improve the overall quality of the 


data. 


As mentioned earlier, this is not an ethics book. Here I outlined the direct 
technical effects of choosing attributes and their related spatial locations. 
There are much more trade-offs and I encourage you to only think of this an 


introduction to new reasoning about Schema and who is responsible for it. 


Importance of What It Is 


Earlier I mentioned briefly why “What” is important. To expand on this 


consider that there oftentimes there are severe limits to the definition of 


where. In complex cases, multiple sensors can be combined to represent a 


complex 3D view of where. In Segmentation cases, per pixel is significant. 


In each of these cases though, there’s a relatively finite limited ability to 
declare spatial locations. There’s only so many pixels or voxels etc. Either 


there’s an object there or not. 


Or is it? 


What about partial objects? Often what we see for complex cases, is that a 


general localization is provided, such a sports player: 


[| drawing of player with bounding box ]. 


IMAGE T0 COME 


Figure 3-5. Figure caption to come 


In this case, a player can be said to be “running”, or maybe “attacking”? 
What about identifying who the player is? Sure we can say the “limb” is in 


a certain position, but what does that really mean? 


Further, through a variety of assist and pixel-based methods, localization is 
relatively speaking easier to solve. Whereas the “meaning” of some 


concepts in life is the subject of philosophy books. 


To be clear there can still be significant challenges localizing novel objects, 
but generally, for ongoing concerns, the “meaning” of what it is, becomes 


more central. 
The hidden background case 


When we think about where something is, it’s sometimes more useful to 
start with where something is not. For example, when we say it’s a traffic 
light at some point, what we are actually saying is 96% of the image is not a 


traffic light. This is Where conflated with What. 


In the case of object detection, this is often handled with an “objectness” 
score, and integrates with region proposal methods. The implementation of 
that is beyond the scope of this book. Where it relates to training data is that 
often training data is created without a 1:1 correlation to how it will be 


used. 


For example, it may be visually easier for a hierarchical “nested” list to be 
displayed to a human. But in actual practice that may be implemented in the 
network in many ways. Such as “flattening” the labels (red_occluded_20, 
red_occluded_40), but combining multiple networks, or by using an 


architecture that supports nesting. 


Depending on the implementation, it’s possible to predict: 


e An object here 


e Object classification is x 


These methods may change in the future, and there is already a significant 


breadth of approaches. 


Example of sharing attributes between labels 


Earlier I introduced a Label as the highest level of semantic meaning 
defining “What” something is. For example a “Strawberry” or “Leaf”. 
Attributes are introduced as the breadth and depth of What. So how do 


these work together? 


Let’s imagine we are building a system to automatically detect what percent 
of fans at a sports match are cheering for one team over another. And 


perhaps it also needed to identify “Offensive” content. 


We may want to identify articles of clothing, such as t-shirts, pants, and 
ballcaps. There are “What” representations like color, team logos, is it 
offensive, etc. that are common to all of those items. There may also be 


certain things that are only relevant to say t-shirts or ball-caps 


One way to structure this is to have t-shirts, pants, and ballcaps as the 


Labels. Then to create an Attribute called “Color”, with the various 


properties such as “Red”, “Blue”, “Green”. All of the labels can have access 


to this attribute. 


Attribute Relation Label 


Color 


Which Sports 
Team? 


TeamA TeamB 
No Sports Logos 


Figure 3-6. Relationship between Attributes and Labels. 


A high level example of trade-offs to think about here. Is it worth it to 
identify color? Perhaps it’s common to wear a team’s colors, but a logo may 
not be visible. So perhaps having a good idea of color may be used for 
further downstream processing to determine which team the person is 


cheering for. 


Technical Specifications 


The following explains technical specifics of defining and using Attributes. 
Technical example of attribute relation to an instance 


For example, a minimal group may be represented in the following form, 


described below. 


“anStance Ls... yl 


i 
"type": “DOX. > 
"attribute_groups": { 
TOMO sO OUP: 4 
"id": id_of_attribute_selected 
"kind": "select" 
- 
} 
ty 
{... another instance } 


In this example, there isa key 1nStance_lList , where each annotation 


is one instance. 


Each instance has a key attribute_groups which is the Attributes 
defined above. So for example if there were 3 groups, there would be 3 
keys inthe attribute_groups dictionary. Each Attribute then has 
information about it’s selected states. Such as IDs of value selected, or the 


raw characters of free text etc. 


Representation, by reference vs by value 


This matters because your Schema will likely change over time. 


As shown below, some things in a Schema can be defined as references. For 
example, here, each option could be represented as a single radio select. 
However, for other types of attributes, like free text, the answer may be 


unique for each instance. 


Sometimes, for ease of change management, it’s useful to “lock” either the 
schema or pass by value even when it’s possible to pass by reference. At a 
very high level, it’s easier to think of labels as “slow-moving” concepts and 
attributes as “fast moving”. Because it’s relatively easy to add and remove 
attributes from labels, but changing labels may introduce breaking changes 
to other attributes. For technical folks the analogy of tables (labels) and 


columns (attributes) in a database works well here. 


High level technical specs: 


Attributes represent the union between human defined meaning and 


machine learning readable data. 


In practice Training Data software has a number of implementation details 
to make that a reality. As a high level overview, here are some examples of 


common attributes, their high level type, and constraints. 


Table 3-2. Table of technical specifications of Attrtibutes 

Attribute Example Output Types Store by Constraints 
Reference or 
Value 

Select String Reference Allow list 

(Drop down) 

or Radio 

Suttons tet 

(Meaning all 

options 


shown) 


Multiple List References Allow list 


Select 
i hie 
| 
H 
i 
l VY. wt 
Free Entry String, Integer Value 


¢ Blocklist 

e Force type 
(String, 
Integer, 


Float etc) 


e Force 


Range 
Be 
vonage 
Slider 
iii 
hikes! + 
Date Date string Value Is date 


or ISO 8601 


Child Node Expands String eg Reference 


another unique hash, 


Attribute Integer, eg a 
Group Integer 


Primary Key 


Attribute vs attribute group 


For ease of speaking I sometimes refer to an Attribute like Color as simple 


“Attribute”. More technically this is a set of attributes. 


Technical example of attribute 


This assumes some kind of map of 1d_of_group to information about 


the group is available, such as: 


"attribute_groups_reference": [ 


{ 
eG es SGC ipo eG Ul ue, 
ODElOnS. = (I 
i 
"id": id_of_attribute_selected 
"name": "option_name" 
i 
], 
"prompt": “user defined prompt", 
Ir 
] 


[Illustration] Eg similar to: 


How Occluded is this? 


@ 0-20% 
©) 21 - 40% 
©) 41 - 60% 


C) 61-80% 


C) 81-100 % 


Figure 3-7. Figure caption to come 


Where Is It? - Spatial Representation 


Here I will introduce the common spatial types and concepts. Later we will 


see how these relate to Machine Learning tasks and some common trades 


offs between them. Also known as Localization, or Spatial Locations, 


Spatial Types, Shapes, Drawing Tools. 


Computer Vision Spatial Types 
Three popular ones are are Segmentation, Box, and Full Image. 
Full image tag 


Full image tags lack a spatial location, which in some cases may limit their 
usefulness. In cases where the Schema is very broad, tags can still be very 
useful. However, in other cases if we don’t know where the item is, or there 
are multiple items, the value is less, in those cases, there is a faulty 
assumption that the “item” of interest already fills the frame, which may not 


be realistic or may overly constrain potential results. 


Box 


The bounding box is one of the most “tried and true” methods of object 


detection. 


A box is defined by exactly 2 points. It requires only 2 pieces of 
information to store a box: For example it can be defined as either as the 
(top left point x,y) and (width, height), or as an (x_min, y_min) and 
(x_max, y_max) pair of points.2 A box may be rotated and defined by an 


origin and rotation value. 


[Illustration of box example] 


IMAGE TO COME 


Figure 3-8. Figure caption to come 
Polygon 


It would be quite painful to literally label every pixel. So to get around this 
various higher level types are used. In this case, a Polygon may be drawn 
around the border. The Polygon is the Where (Spatial) and the Label “liver” 
is the What. A polygon is naturally defined by at least 3 points. There is no 
predefined limit to how many points a polygon may have. Often points are 
created via an assisted process, such as dragging the mouse or holding the 


shift key, or a “magic wand” type tool. 


[Illustration of polygon example] 


IMAGE TO COME 


Figure 3-9. Figure caption to come 


Continuing the segmentation example above, in the context of an image, 
“per pixel” labels can be used, here each green pixel is predicted as “liver” 


on this CT scan: 


| diagram showing pixel level annotations on an image / segmentation 


example / examples] 


IMAGE T0 COME 


Figure 3-10. Figure caption to come 


An example of spatial locations is “per pixel” or more commonly 


“pixelwise”. 


Is the “highest” level of 2D localization. 


Keypoint 


Keypoint is another type. The keypoint type is a graph with defined nodes 
(points) and edges. For example an anatomical model of a person, with 


fingers, arms, etc. defined. 


Ellipse and circle 


Is defined by a center point and a radius (x,y). These are used to represent 


circle and oval objects. It functions similar to a box. It can also be rotated. 


Figure 3-11. Example of Ellipse 


Cuboid 


A cuboid has two “faces”. This is a projection of a 3D cuboid onto 2D 


space. Each face is essentially a “box”. 


Figure 3-12. Example of Cuboid 


Lines and Curves 
Straight line 


2 points 


Quadratic curves 


2 points and a control point. 


Ree 
sae 
ah 
yi) 
7) 


| Pee 


~~ 
, 


Figure 3-13. Example of Quadratic Curve in UI 


Types with Multiple Uses 


Some types have multiple uses. A polygon can be easily used as a box by 


finding the extreme most points. There is not exactly a “hierarchy”. 


Complex Spatial Types 


A Complex type refers to a spatial location defined by a set of multiple 
“primitive” types, such as Box and Polygon as defined above. This is 
similar, but different, to the concept called “Multi-Modal”. Generally Multi- 


Modal refers to multiple items of raw data. 


A common use case for complex types is complex polygons. Such as a 
polygon that is partially defined by segments of quadratic curves and 


partially by segments of “straight” points. 


Trade Offs with Types for Architecture and Creation 


Relevancy to concept 
In general the type must “match” the concept to some extent. For 


example, forcing an octagon shape for a car is unlikely to yield a benefit. 


Annotation time 
Of course the more complex the more time. But often it’s in a step 
function type fashion, eg segmentation may take 10x longer to annotate 


then a box. 


A comparison relevant to images and video is shown in Fig 2-10. Similar 
trade-offs exist as more precision is added in other media types. For 


example document > sentence > word > character. 


Trade Offs with Types for Usage 


e¢ Compute cost 
e Research completeness 


¢ Complexity of output 


Classification Object Detection Instance 
Segmentation 


CAT, DOG, DUCK CAT, DOG, DUCK 
Compute Cost 6 66 686686 
Annotation Effort o 44 L4L444 
Location x oS SJ 
Research completeness BGG 8G 83 
Complexity of output 74 V9 VVOVPM 


Figure 3-14. Trade off of different spatial types. 3 


When Is It? - Relationships, Sequences, Time 


Series 


Sequences and Relationships 


Many of the most interesting cases have some form of relationship between 


Instances. 


Imagine a football game. A soccer ball is touched by a player. This could be 
considered an “event”. At a given frame, eg frame 5. In this case we have 
two Instances, one of a Ball and one of a Player. Both of these are in the 
same frame. They can be formally linked with an annotation relationship 
between the two of Instances, or can be assumed to be linked in time 


because they occur in the same frame. 


When 


Another big concept at play here is Relationships. Consider upgrading our 
still image example to a video context. From the human perspective we 
know the “liver” in frame 0 is the same organ as the one in frame 5, 10 etc. 
It’s a persistent object in our minds. This needs to be represented in some 
form, often called a Sequence or Series. Or it can be used to “ReID” an 
object in a more global context, such as the same person appearing in 


different contexts. 


[Illustration: Not exactly this, but something showing the same object. (this 


text has interpolation which is separate concept] 


Keyframe - 1 Interpolated frames 2 through 9 Keyframe - 10 


Figure 3-15. Interpolation Example 


Guides, Instructions 


Of course, how do we know to label it “liver” at all? I sure wouldn’t be able 
to figure that out with some form of guide. What about sections of the liver 


that are occluded by the Gallbladder or stomach? 


Liver 


Left hepatic 
duct 


Common 
hepatic 
duct 


Stomach 


“iagaeees bile 


(first part of duct 


small intestine) 


Figure 3-16. CAPTION 


In a Training Data context we formalize the concept of a Guide. In a strict 
sense, the Guide should be maintained with as much respect as the actual 
training data, because the “meaning” of the training data is defined by the 


Guide. In a sense Guides address the “How” and “Why”. 


We have covered some of the high level mechanics of representing Training 
Data. While complex scenarios and constraints can be a challenge, often the 
“real” challenge is in defining useful instructions. For example the 
NuScences data set, has about a paragraph of text, bullet points, and 5+ 


examples for each top level class. 


In Fig 2-13 the example image shows the difference between a “bike rack” 
and a “bicycle”. This would be provided as part of a larger guide to 


annotators. 


Figure 3-17. Example of Clarifying Image provided in a Guide to Annotators 


Judgment Calls 


To see how some of this can quickly get dicy, consider “drivable surface” vs 
“debris”. NuScences defines it as “Debris or movable object that is left on 
the driveable surface that is too large to be driven over safely, e.g tree 


branch, full trash bag etc.”4 


Of course what’s safe to drive over in a semi-trailer vs a car is different. 
And this isn’t even getting into choice semantics like “should you drive 


over debris to avoid rear ending someone”. 


Choosing Good Names 


There’s a classic line about there being two hard things in computer science, 
and naming being one of them®. Late in the book I cover good and bad 
training data concepts in more detail, but for now let’s just consider the 


“architecture” of it. 


We will introduce the key concepts and then dive deeper into the technical 


specifics of what that means. 


Historically Vision and natural language processing (NLP) were treated as 
very different tasks. In the new deep learning context there is much more 


overlap. 


Going back to our cats example, it’s tempting to thinking of training data as 
straight forward - especially since in early history it was generally an 


outsourced task. Yet, the complexity is expanding rapidly. 


Relation of Machine Learning Tasks to Training 
Data 


Training Data is used in a Machine Learning system. Therefore it’s natural 
to want to understand what ML Tasks are common and how they relate to 


Training Data. 


There is general community consensus on some of these groupings of tasks. 
There are many other resources that provide a deeper look at these tasks 
from the machine learning perspective. Here I will provide a brief 


introduction to each task from the Training Data perspective. 


Tasks 


Semantic segmentation 


Figure 3-18. CAPTION 


Every pixel is assigned a Label. 


An upgraded version of this is called “Instance Segmentation”, where 
multiple objects that would otherwise be assigned the same label are 
differentiated. For example if there are 3 people, each person would be 


identified as different. 


With Training Data, This can be done through “vector” methods (e.g. 
polygons) or “raster” methods (picture a paint brush). At the time of writing 
the trend seems to favor vector methods but it is still an open question. 
From a technical perspective vector is much more space efficient. Keep in 
mind that the user interface representation here may differ from how it’s 
stored. For example a UI may present a “bucket” type cursor to select a 


region, but still represent the region as a vector. 


Another option question at the time of writing is how to actually use this 
date. Some new approaches predict polygon points, whereas the “classic” 
approach here is per pixel. If a polygon is used for training data and the ML 
approach is classic, then the polygon must go through a process to be 
converted into a “dense” mask. (This is just another way of saying a class 
for each pixel.) The inverse is also true if a model predicts a dense mask but 


the UI requires polygons for a user to more easily edit. 


Note that if a model predicts vectors - polygons - to fulfill the classic 
definition of segmentation then the vector must be rasterized. Rasterized 
means converted into a dense pixel mask. Depending on your use case this 
may not be necessary. Keep in mind that while a per pixel mask may appear 
more accurate, if a model based on a vector approach more accurately 
captures the relative aspects, that accuracy may be more illusionary than 
real. This is especially relevant to contexts that contain well documented 
curves that can be modeled by a few points. For example, if your goal is to 


get a useful curve, predicting the handful of values for a quadratic curve 


directly may be more accurate then going from per pixel predictions back to 


a curve. 


While there are many great resources available, one of the most prolific is 
the site Papers With Code lists over 885 computer vision tasks, 312 NLP 


tasks, and 111 in other categories.® 


Let’s walk through an example of a Medical use case. If one has a goal of 
doing Automatic Tumor Segmentation (eg of CT scans), Segmentation 
Training Data is needed.2 Researchers routinely note the importance of 
training data. “The success of semantic segmentation of medical images is 


contingent on the availability of high-quality labeled medical image data.”® 


Figure 3-19. Example of Tumor Segmentation a sub category of Segmentation” 


Training Data is created in conjunction with the Task goal, in this case, 
automatic tumor segmentation. In the above example, each pixel 
corresponds to a classification, this is known as “pixel-wise”. Each green 


pixel is predicted as “liver” on this CT scan. 


Image classification (tags) 


Whole image classification. An image may have many tags. While this is 


the most generic, consider that all the other methods are essentially built on 


top of this. From an annotation perspective this is one of the most 


straightforward. 


Object detection 


Detect spatial location of multiple objects and classify them. This is the 
classic “box drawing”. It’s generally the fastest spatial location annotation 
method and provides the big leap of getting spatial location. It’s a great 
“default choice” if you aren’t quite sure where to start because it’s often the 


best bang for your buck. 


While most of the research has centered around “boxes”, there is no 


requirement to do this. There can be many other shapes like ellipses etc. 


Pose estimation 


Figure 3-20. CAPTION 


At a high level this is “complex object detection”. Instead of a general 
shape like a box, we try to get “keypoints” (graphs). There’s a graph-like 
relationship between the points. Such that say the left eye is within some 


bounds of the right eye etc. 


From a training data perspective, this is handled via the Keypoint template. 
For example, creating an 18 point representation of a human skeleton to 
indicate pose and orientation. This is different from segmentation/polygons 
because instead of drawing the outline, we are actually drawing “inside” the 


shape. 


Chart - Relationship of Tasks to Training Data Types 


There is not a one-to-one correlation between annotation data and machine 
learning tasks. However there is often a loose alignment, and some tasks are 
not possible without a certain level of spatial data. For example, you can 
abstract a more complex polygon to a box for object detection, but cannot 


easily go from box to segmentation. 


Table 3-3. Mapping of spatial types to machine learning tasks 


Training Data Spatial Type 
Polygon, 
Brush 


Box 


Cuboid 


Tag 


Keypoints 


Machine Learning Task Example 


e Segmentation 
e Object Detection® 


e Classification 


e Object Detection 


e Classification 


e Object Detection 
e 3D Projection 
¢ Pose Estimation 


e Classification 


e Classification 


e Pose Estimation 


e Classification 


* Well technically it can be used for object detection, usually that’s a 


secondary goal because it’s faster to do a box for the generic detection 


base. 


General Concepts 


The following concepts apply to, or interact with, both Spatial and What 


Instance Concept Refresher 


Virtually everything discussed here relates to an Instance (Annotation). An 
Instance represents a single sample. For example here each person is an 


Instance. 


An Instance has references to Labels and Attributes, and may also store 


concrete values, such as free text that’s specific to that Instance. 


In a video or multiple frame context, an instance uses an additional ID to 
relate to other Instances in different frames. Each frame still has a unique 
Instance, because the data may be different, for example, a person may be 


standing in one frame and sitting in another, but they are the same person. 


To illustrate a subtle difference, consider that here all 3 people have the 
same “class”, but are different instances. If we didn’t have this Instance 


concept then we would get a result like the middle image. 


[Image showing something like this: ] 


Figure 3-21. Comparison of treating people as one group, vs as individual people (instances) 22 


Consider that for every Instance, there can be n number of Attributes, which 
may in turn have n number of children, and the children may have arbitrary 
types/constraints (such as selection, free text, slider, date) of their own? Yes 
- in fact each instance is almost like a mini graph of information. And if that 
wasn’t enough, the spatial location may actually be 3D. And there may be a 


series of frames. 


Modern software tools handle the relationships between these concepts 
well. The challenge is the data supervisor must, at least to some degree, 


understand the goal, so as to do a reasonable job. 


Further, the types must align with the use case in the neural network. 
Common network architectures have assumptions to be aware of when 


constructing training data. 


Upgrading Data Over Time 


In some cases it can be quite reasonable to “upgrade” data over time. For 


example a common pattern 


e Runa “weak” classify to tag images, purely with the intent of identifying 
“good” training data. 

e Humans create bounding boxes. 

e Ata later point in time, a specific class is identified as needing 
segmentation. The bounding box can then be used for an algorithm 
designed to “generate” segmentation data once a rough localization has 


been done (the existing bounding box). 


The Boundary Between Modeling and Training Data 


There is often a disconnect between the training data and how the machine 
learning model actually uses it. One example shown earlier is how the 
spatial locations do not have a one to one mapping to machine learning 


tasks. Some other examples include: 


e An ML model uses integers while humans see string labels 

e¢ A model may ‘see’ a dense pixel mask (for segmentation) but humans 
rarely if ever consider single pixels. Often segmentation masks are 
actually drawn through polygon tools. 

¢ ML research and Training Data advance independently of each other. 

¢ The ‘where vs what’ problem: Humans often conflate where something 
is and what it is, but this may be two distinct processes in an ML 
program. 

e The transformation of human-relevant Attributes to use in the model 


may introduce errors. For example, humans viewing a hierarchy, but a 


model flattening that hierarchy. 
¢ The ML program’s efforts at localization are different from Humans. For 


example, a human may draw the box using 2 points, while the machine 


t11 or use other methods, that don’t have 


anything to do with those 2 points drawn#2. 


may predict from a center poin 


Basically what this means is: 


e We must constantly remind ourselves that the ML modeling is different 
from the Training Data. Integers, not human labels, a different 
localization process, different assumptions about where vs what, etc. 
This is difference is expected to widen over time as Training Data nd 
ML modeling continue to progress independently. 

e Capture as many assumptions about how the training set was constructed 
as possible. This is especially vital if any type of Assist Method is used 
to construct the data. 

¢ Ideally, whomever manages the training set, should have some idea of 
how the model actually uses the data. And whoever manages the model, 
should have some idea of how the set is managed. 

e It’s crucial that Training Data can be updated in response to changing 
needs. There is no such thing as a valid static dataset when the ML 


programs interacting with the Training Data are changing. 


Raw Data Concepts 


Raw data means the literal data, the images, videos, audio clips, text, 3D 
files, etc. being supervised. More technically it’s the raw bytes or BLOB 


that is attached to the human-defined meaning. 


The raw data is immutable whereas the human-defined meaning is mutable. 
We can choose a different schema, or choose how to label it etc., however, 
the raw data is what it is. In practice, the raw data usually undergoes some 
degree of processing. This processing generates new BLOB artifacts. The 
intent of this processing is usually to overcome implementation level 
details. For example, converting a video into a streamable HTML friendly 
format, or re-projecting coordinates of a GeoTIFF file to align with other 


layers. 


These processing steps are usually documented in the Training Data 
software. Metadata about the processing should also be available. For 
example, if a certain token parsing scheme was used in a pre-processing 
step, the machine learning program may need to know that. Usually this is 


not a major issue, but it’s something to be aware of. 


It’s best to keep data as near to the original format as possible. For example, 
natively rendering a PDF vs an image of a PDF. Or native HTML vs a 
screenshot. This is complementary to implementation details of generating 
new artifacts. For example a video may playback in a video format, and 


may additionally generate images of the frames as artifacts. 


Accuracy is how close a given set of measurements are to their true value. 
Some ML programs have accuracy constraints that may be very different 
from Training Data constraints. For example, the resolution during 
annotation may differ from the resolution trained on. The main tool here is 
the metadata. As long as it’s clear what resolution or accuracy was present 
when the BLOB was supervised, the rest must be handled on a case by case 


basis. 


Historically it was common for models to have fixed limits on resolution 
(Accuracy), however this continues to evolve. From the Training Data 
perspective the main responsibility is to ensure the ML program or ML 
team knows the level of accuracy at the time of human supervision, and if 
there are any known assumptions around that Accuracy. For example 
surfacing assumptions, e.g. that a certain Attribute requires a minimum 


level of resolution to be resolvable by a human. 


Some BLOBs introduce extra challenges. A 3D point cloud may contain 
less data than an images.12 However, 3D annotation is often more difficult 
for an end user than image annotation. So while the technical data may be 


less, the human supervision part may be harder. 


Multiple BLOBs may be combined into a compound file. For example, 
multiple images could be related to each other as a single document. Or an 


image may be associated with a separate PDF. Conversely, a single BLOB 


may be split into multiple samples. For example, splitting a large scan into 


multiple parts. 


BLOBs may be transformed between different views. For example 
projecting a set of 2D images into a single 3D frame. Further, this 3D frame 
may then be annotated from a 2D view. And as a related concept, data may 
be labeled in different dimensional space than what it gets trained on. For 
example, the labels may be done in a 2D context, but trained in 3D space. 
The main thing to be aware of here is differentiating between what is 


human supervision, and what is a calculated or projected value. 


To summarize, there is a high level of complexity to the potential raw data 
(BLOB) formats and associated artifacts. The main responsibility of 
Training Data is to record processing steps done, record relevant 
assumptions present at human supervision time, and work with Data 
Science to ensure alignment between accuracy available at human 
supervision and usage in the ML program. By doing this we recognize the 
limits of assumptions we can make, and provide everything needed to make 


the best use of the data, and overall system success. 


Summary 


Next we will move from this high level viewpoint to that of a data 


specialist. How do we literally supervise and annotate this? 


https://dl.acm.org/doi/pdf/10.5555/77708 Pg. 43 Actual Pg.68 in PDF “In 
subsequent papers (e.g., Codd 1971a, 1971b, and 1974a), I realized the need 
to make this distinction, and introduced domains as declared data types, and 


attributes (now often called columns) as declared specific uses of domains.” 


https://medium.com/diffgram/how-do-i-design-a-visual-deep-learning- 


system-in-2019-8597aaa35d03 


https://medium.com/diffgram/how-do-i-design-a-visual-deep-learning- 


system-in-2019-8597aaa35d03 


https://github.com/nutonomy/nuscenes- 


devkit/blob/master/docs/instructions_nuscenes.md 
predicting when to invalidate the cache is the other. 
https://paperswithcode.com/sota Accessed Oct 4 2020 


arXiv:1702.05970 [cs.CV] Automatic Liver and Tumor Segmentation of CT 
and MRI Volumes using Cascaded Fully Convolutional Neural Networks - 


Patrick Ferdinand Christ et el 


arXiv:1702.05970 figure 7 


https://davidstutz.de/bottom-instance-segmentation-using-deep-higher- 


order-crfs-arnab-torr/ 
TODO I think there’s a specific center point paper we can reference 


TODO be more specific, I’m thinking of SSD and how some of the 


regression points work here 


A point cloud stores a triplets for coordinates (x, y, z), there are often a few 
million of these points. Where as a single 4k image contains over eight 


million RGB (Red Green Blue) triplets. 


Chapter 4. Data Engineering 


A NOTE FOR EARLY RELEASE READERS 
With Early Release ebooks, you get books in their earliest form—the 
author’s raw and unedited content as they write—so you can take advantage 


of these technologies long before the official release of these titles. 


This will be the 4th chapter of the final book. Please note that the GitHub 


repo will be made active later on. 


If you have comments about how we might improve the content and/or 
examples in this book, or if you notice missing material within this chapter, 


please reach out to the editor at jleonard@oreilly.com. 


Introduction 


In earlier chapters, you were introduced to abstract concepts. Now, we’ll 
move forward from that technical introduction to discuss implementation 
details and more subjective choices. I’1l show you how we work with the art 
of Training Data in practice as we walk through scaling to larger projects 


and optimizing performance. 


Data Ingestion is the first and one of the most important steps. At a high 


level, this means getting your data into and out of your Training Data 


Database. Why is Ingestion hard? Many reasons. For example, Training 
Data is a relatively new concept, there are a variety of formatting and 
communication challenges, the volume, variety, and velocity of data vary, 


and the lack of well-established norms, leading to many ways to do it. 


Also, there are many concepts, like using a Training Data Database, and 
who wants to access what when, that may not be obvious, even to 
experienced engineers. Ingestion decisions ultimately determine query, 


access, and export considerations. 


This chapter is organized into: 


e¢ Who wants to use the data and when they want to use it 

e Why the data formats & communications matter, think “game of 
telephone” 

e Introduction to a Training Data Database 

e The technical basics of getting started 

e Storage, media specific needs, and versioning 

e Commercial concerns of formatting and mapping data 


e Data Access, Security, and Pre-labeled data 


To achieve a data-driven or data-centric approach, tooling, iteration and 
data are needed. The more iteration and the more data the more need for 


great organization to handle it. 


You may ingest data, explore it, and annotate it in that order. Or perhaps 
you may go straight from ingesting to debugging a model. After streaming 
to training, you may ingest new predictions, then debug those, then use 
annotation workflow. The more you can lean on your Database to do the 


heavy lifting, the less you have to do yourself. 
Who Wants The Data? 


Before we dive into the challenges and the technical specifics, let’s set the 
table about goals and the humans involved here, and discuss how Data 
Engineering services those end users and systems. After, I’ll cover the 
conceptual reasons for wanting a Training Data Database. I’ll frame the 
need for this by showing what the default case looks like without it, and 


then what it looks like with it. 
For ease of discussion, we can divide this into groups: 


e Annotators and End Users 

e Data science 

e ML Programs (Machine to Machine) 
e Application engineers 


e Other stakeholders 


Annotators and End Users 


End users need to be served the right data at the right time with the right 
permissions. Often this is done at a single file level, and is driven by very 
specifically scoped requests, There is an emphasis on permissions and 
authorization. What is the “right time?”, well in general it means on- 
demand or online access. This is where the file is identified by a software 


process, such as a task system, and served with fast response times. 


Data Science 


Data Science most often looks at data at the set level. More emphasis is 
placed on query capabilities, the ability to handle large volumes of data, and 
formatting. Ideally, there is also the ability to drill down into specific 
samples and compare the results of different approaches both quantitatively 


and qualitatively. 


ML Programs 


ML Programs follow a similar need to Data Science. Differences include 
permissions schemes, and clarity on what gets surfaced and when. Often 


ML programs can have a software defined integration or automation. 


Application Engineers 


Application Engineers are concerned with getting data from the application 
to the Training Data Database and how to embed the Annotation and 


Supervision to end users. Queries Per Second (Throughput), volume of 


data, are often top concerns. There is sometimes a faulty assumption that 
there is a linear flow of data from an “Ingestion” team, or the application, to 


Data Science. 


Other Stakeholders 


Security, DevMLOps, backup systems, etc. These groups often have cross- 
domain concerns and cross-cut other users’ and systems needs. For 
example, Security cares about end-user permissions already mentioned. 
Security also cares about a single data scientist not being a single point of 
critical failure, e.g. having the entire dataset on their machine or overly 


broad access to remote sets. 


Now that you have an overview of the groups involved, how do they talk 


with each other? How do they work together? 


A Game of Telephone 


Without knowledge of Training Data Data Engineering your end users and 


systems will not work well with each other. 


As an analogy, you can think of a Game of Telephone. Telephone is “a 
game where you come up with a phrase and then you whisper it into the ear 
of the person sitting next to you. Next, this person has to whisper what he or 
she heard in the next person’s ear. This continues in the circle until the last 


person has heard the phrase. Errors typically accumulate in the retellings, so 


the statement announced by the last player differs significantly from that of 
1 


the first player, usually with amusing or humorous effect. 
In Training Data, these accumulated errors are not humorous. Accumulated 
errors Cause poor performance, system degradation, and failures. These 
errors are especially prevalent because it’s not a one-way, linear integration, 
or even a two-way sync. Instead, it’s a graph like, non-linear structure 


between sensors, humans, models, and other data engineering systems. 


Problems are especially prevalent in larger, multi-team contexts, and when 
the holistic needs of all major groups are not taken into account. As a new 
area with emerging standards, even seemingly simple things are poorly 
defined. In other words, Data Engineering is especially important for 
Training Data. If you have a Greenfield project then now is the perfect time 


to plan your Data Engineering. 


Whether you are planning a new project or rethinking existing projects, 


some signals that you need a new take on Data Engineering include: 


e Each team owns their own copy of the data 

e There is no true Training Data System of Record 

e Your System of Record does not holistically represent the state of 
Training Data (e.g. thinking of it as just labeling rather than as the center 


of gravity of the process). 


e There is an unreasonable level of communication required to make 
changes (e.g. changing the Schema). 

e Changes cannot be done fluidly by end users but must be discussed as an 
engineering-level change. 

¢ Overall system performance is not meeting expectations or models are 


slow to ship or update. 


If each team owns its own copy of the data, there will be unnecessary 
communications and integration overhead. This is often the “original sin”, 
since the moment there are multiple teams doing this, it will take an 
engineering-level change to update the overall system, meaning updates 
will not be fluid, leading to the overall system performance not meeting 


expectations. 


Some attempts to avoid this include keeping a separate system as the 
System of Record for the data instead of the Training Data System. This is 
another case of multiple teams owning a copy of the data, with similar 
problems. The data will keep getting garbled, so while one tool may know 
about xyz properties, the next tool does not import it, and it likely won’t 
export all of the properties that it stores or imports. No matter how trivial 
these issues seem on the whiteboard, they will always be a problem in the 


real world. 


The user expectations and data formats change frequently enough that it 


resists an overly rigid automatic process. So don’t think about this in terms 


of Automation, but rather in terms of “where is the center of gravity of 
Training Data?”. It should be with the humans and the Training Data 


Database in order to get the best results. 


Mental Model 


Figure 4-1. Example mental model of overall process. 


Planning A Great System 


So how do you avoid the Game of Telephone? It starts with planning. Here 
are a couple of thought starters, and then I’ll walk through best practices for 


the actual setup. 


The first, is to establish a meaningful unit of work relevant to your business. 
For example, for a company doing analytics on medical videos, it could be 
a single medical procedure. Then, within each procedure think about how 
many models (don’t assume one!), how often they will be updated, how the 
data will flow etc. We will cover this in more depth in the system design 
question, but for now, I just want to make sure it’s clear that ingestion is 
often not a “once and done” thing. It’s something that will need ongoing 


maintenance and likely expansion over time. 


Second, is to think about the data storage and access, a Training Data 
Database. While it is possible to “role your own”, it’s difficult to holistically 
consider the needs of all of the groups. The more a Training Data Database 
is used, the easier is it is to manage the complexity. The more independent 
storage is used, the more pressure is put to “reinvent the wheel” of the 


database. 


Specific to building a great ingestion sub-system. Usually the ideal is that 
these sensors feed directly into a training data system. Think about how 
much distance, or hops, is there between {sensors, predictions, raw data} 


and my training data tools? 


Production data often needs to be reviewed by humans, analyzed at a set 
level, and potentially further “mined” for improvements. The more 


predictions, the more opportunity for further system correction. How will 


production data get to the training data system in a useful way? How many 


times is data duplicated during the tooling processes? 


What are our assumptions around the distinctions between various uses of 
the data? For example querying the data within a training data tool scales 
better than expecting a data scientist to export all the data and then query it 


themselves after. 


Naive & Training Data Centric approaches 
There are generally two major concepts to be aware of: 


Naive approaches 
Tend to see Training Data as just one step to be bolted alongside a series 


of existing steps. 


Training Data centric approaches 

Sees Training Data, the human supervision, as the “Center of Gravity” of 
the system. For example, a Training Data Database has the definitions, 
and/or literal storage, of the raw data, annotations, Schema, and mapping 


for machine-to-machine access. 


There is naturally some overlap in approaches. In general, the greater the 
competency of the naive approach, the more it starts to look like a re- 


creation of a Training Data Centric one. While it’s possible to achieve 


desirable results using other approaches, it is much easier to consistently 


achieve desirable results with a Training Data Centric approach. 


Let’s get started by looking at how naive approaches typically work. 
Naive Approaches 


Typically in a naive approach, sensors capture, store, & query the data 
independently of the training data tooling, as shown in Fig 4-2. This usually 
ends up looking like a linear process, with pre-established start and end 


conditions. 


Data Bucket .| Human Reviews | __;—| Training Data 


Data 
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Data 


it 


Manual Export 


Figure 4-2. Naive Data Engineering Process Example 


The most common reasons Naive approaches get used: 


e The project started before Training Data Centric approaches had mature 
tooling support 
e Engineers did not know about Training Data Centric approaches 


e Testing and development of new systems 


¢ Old, historical data, with no chance of new data coming in (rare) 
e Cases where it’s impractical to use a Training Data Centric approach 


(rare) 


Naive approaches tend to look like the Game of Telephone mentioned 
earlier. Since each team has their own copy of the data, errors accumulate as 
that data gets passed around. Since there is either no System of Record, or 
the System of Record does not contain the complete state of Training Data, 
it is difficult to make changes at the user level. Overall the harder it is to 
safely make changes, the slower it is to ship and iterate, and the worse the 


overall results are. 


Additionally, naive approaches tend to be coupled to hidden or undefined 
human processes. For example, some engineer somewhere has a script on 
their local machine that does some critical part of the overall flow, but that 
script is not documented or not accessible to others in a reasonable way. 
Often because of a lack of understanding of how to use a Training Data 


Database as opposed to a purposefully decided process. 


In naive approaches there is a greater chance of data being unnecessarily 
duplicated. This increases hardware costs, such as storage and network 
bandwidth, in addition to the already mentioned conceptually bottlenecks 
between teams. It also can introduce security issues since the various copies 
may maintain different security postures. For example a team correlating 


data and bypassing security in a system earlier in the processing chain. 


A major assumption in naive approaches is that a human is manually 
reviewing the data so that only the data desired to be annotated is imported. 
In other words, only data designated for annotation is imported. This 
“human before import” assumption makes it hard to effectively supervise 
production data, use exploration methods, and generally bottlenecks the 
process because of the manual and undefined nature of the hidden curation 


process. 


Please think of this conceptually, not in terms of literal automation. A 
software-defined ingestion process is by itself a little indication of overall 
system health since it does not speak to any of the real architectural 


concerns around the usage of a Training Data Database. 


Training Data Centric 


Another option is to use a Training Data Centric approach. A Training Data 
Database, as shown in Fig 4-3, is the heart of a Training Data Centric 


approach. 
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Figure 4-3. Training Data Database 


A Training Data Database has the definitions, and/or literal storage, of the 
raw data, annotations, Schema, and mapping for machine-to-machine 
access, and more. Ideally, it’s the complete definition of the system, 
meaning that given the Training Data Database, you could reproduce the 


entire ML process with no additional work. 


When you use a Training Data Database as your System of Record there 
becomes a central place for all teams to store and access Training Data. The 
greater the usage of the Database the better the overall results, similar to 
proper use of a Database in a traditional application is well known to be 


essential. 


The most common reasons to use this approach: 


¢ It supports a shift to data-centric machine learning. This means focusing 


on improving overall performance by improving the data instead of just 


improving the model algorithm. 

e It supports using multiple ML programs in one process, by having the 
definitions all in one place. 

e Supports end users supervising (annotating) data. And supports 


embedding end-user supervision deeper in workflows and applications. 


Other reasons include: 


¢ Decouples visual UI requirements from data modeling (e.g. query 
mechanisms) 

e Enables faster access to new tools for data discovery, what to label, and 
more. 

e Enables user-defined file types, e.g. representing an “Interaction” as a set 
of images and text, supporting fluid iteration and end user-driven 
changes. 

e Avoids data duplication, and stores external mapping definitions and 
relationships in one place. 

e It unblocks teams to work as fast as they can instead of waiting for 


discrete stages. 


The problems with a Training Data Database approach: 


e Requires knowledge of its existence and conceptual understanding. 


e There is the time, ability, and resources to use a Training Data Database. 


e Well-established data access patterns may require re-working to the new 
context. 
¢ Reliability of it as a system of Record does not have the same history as 


older systems have. 


Ideally, instead of deciding what raw data to send (e.g. from an Application 
to the Database), all relevant data gets sent to the Database. This means 
swapping the order to put all the data in the Database first. This helps to 
ensure it is the true system of record. For example, there may be a “what to 
label” program that uses all of the data, even if the humans only review a 
sample of it. Having all the data available to the “what to label” program 
through the Training Data Database makes this easy. The easiest way to 
remember this is to think of the Training Data Database as the center of 


gravity of the process. 


The Training Data Database takes on the role of managing references to or 
even contains the literal byte storage of raw media. This means sending the 
data directly to the training data tooling. In actual implementation, there 
could be a series of processing steps, but the idea is that the final resting 


place of the data, the source of truth, is inside the Training Data Database. 


A Training Data Database is the complete definition of the system, meaning 
that given the Training Data Database, you can create, manage, and 
reproduce all end user ML application needs, including ML processes, with 


no additional work. This goes beyond MLOps which often focuses more on 


automation, modeling, and reproducibility purely on the ML side. You can 
think of MLOps as a sub concern of a more strategic Training Data Centric 


approach. 


A Training Data Database considers multiple users from day one and plans 
accordingly. For example an application designed to support data 
exploration can create indexes for data discovery automatically created at 
ingest. When it saves annotations, it can create indexes for discovery, for 


streaming, for security etc, all at the same time. 


A closely related theme is that of exporting the data to other tools. Perhaps 
you want to run some process to explore the data, and then need to send it 
to another tool for a security process, such as to blur personally identifiable 
info. Then you need to send it on to some other firm to annotate, get the 
results back from them into your models etc. At each of these steps, there is 
always a mapping (definitions) problem. Tool A outputs in a different 
format than Tool B inputs. I have talked a bit about data scale before, but as 
a quick reminder, this type of data transfer is often on the order of 
magnitude more than in other common systems. The best rule of thumb I 


can think of is that each transfer is more like a mini-database migration. 


Generally speaking, the tighter the connection between the sensors and the 
training data tools, the more potential for all of the end users and tools to be 
effective together. Every other step that’s added between sensors and the 


tools is virtually guaranteed to be a bottleneck. The data can still be backed 


up to some other service at the same time, but generally, this means 


organizing the data in the training data tooling from day one. 


The First Steps 


Let’s say you are on board to use a Training Data Centric approach. How do 


you actually get started? 
The first steps are to 


1. Setup Training Data Database Definitions 


2. Setup Data Ingestion 


Let’s consider definitions first. A Training Data Database puts all the data 
in one place, including mapping definitions to other systems. This means 
that there is one single place for the System of Record and associated 
mapping definitions to modules running within the Training Data System 
and external to it. This reduces mapping errors, data transfer needs, and data 


duplication. 


Before we start actually ingesting data, here are a few more terms that need 


to be covered first. 


e Raw Data Storage 
e Raw Media BLOB Specific Concerns 
e Formatting and Mapping 


e Data Access 


e Security Concerns 


Let’s start with Raw Data Storage. 


Raw Data Storage 


At a high level the goal is to get the raw data, such as images, video, and 


text, into a usable form for Training Data work. 


It is common to store raw data in a Bucket abstraction. This can be on the 
cloud or on local storage using software like MinIO. Some people like to 
think of these cloud buckets as “dump it and forget”, but there are actually a 
lot of performance tuning options available. At the Training Data scale raw 
storage choices matter. If you are a storage guru then feel free to skip this 


subsection. 


Storage Class 

There are more major differences between storage tiers than it may first 
appear. The tradeoffs include access time, redundancy, geo-availability, 
etc. There are orders of magnitude price differences between the tiers. 
The most key tool to be aware of is Lifecycle rules, usually, with a few 
clicks you can set policies to automatically move old files to cheaper 


storage options as they age. Examples of best practices in more granular 


detail can be found here. 


GeoLocation (AKA Zone, Region) 

Are you storing data on one side of the Atlantic Ocean and having 
annotators access it on the other? It’s worth considering where the actual 
annotation is expected to happen, and if there are options to store the 


data closer to it. 


Vendor Support 

Not all annotation tools have the same degree of support for all major 
vendors. Keep in mind that you can typically manually integrate any of 
these offerings, but this requires more effort then tools that have native 


integration. 


Support for accessing data from these storage providers is different from the 
tool running on that provider. Some tools may support access from all three, 
but as a service, the tool itself runs on a single cloud. If you have a system 


you install on your own cloud usually the tool will support all three. 


For example, you may choose to install the system on Azure. You may then 
pull data into the tool from Azure which leads to better performance. 
However, that doesn’t prevent you from pulling data from Amazon and 


Google as needed. 


By Reference or by Value 


If you want to maintain your folder structure some tools support referencing 


the files instead of actually transferring them. The benefit of this is less data 


transfer. A downside is that now it’s possible to have broken links. Also 
separation of concerns could be an issue, for example some other process 


may modify a file that the annotation tool expects to be able to access. 


Even when you use a pass-by-reference approach for the rate data, it’s 
crucial that the system of truth is the Training Data Database. For example, 
data may be organized into sets in the Database that are not represented in 
the bucket organization. The bucket also represents only the raw data, 
whereas the Database will have the annotations and other associated 


definitions. 


For simplicity, it’s best to think of the Training Data Database as one 
abstraction, even if the raw data is stored outside the physical hardware of 


the Database. 
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Figure 4-4. How References to Raw Blobs Intersect a Training Data Database’s other functions. 


Off-the-shelf dedicated Training Data tooling on your 


own hardware 


In this context, we trust the training data tooling to manage the signed URL 
process. We trust the Training Data tooling to handle the IAM concerns. 
With this, assuming we trust the tooling (perhaps because we can inspect 
the source code), the only real concern is what bucket it uses - and that 
generally becomes a one-time concern, because the tool manages the IAM. 
For advanced cases, the tool can still link up with a single sign-on (SSO) or 


more complex IAM scheme. 


The tooling doesn’t have to run on the hardware. The next level up is to trust the 
training data tooling and trust the service provider to host/process the data. Although at 
that point most of this discussion is less relevant because really just following 


instructions, however, it’s the least level of control. 


Data storage 


By reference or physically moving the data. 


Where does the data rest? 


Generally speaking, any form of tooling will generate some form of data 


that is added to the “original” data. 


This means if I have data in bucket A and I use a tool to process it there 
either be additional data in bucket A or I will need a new bucket B to be 
used by the tool. This is true for Diffgram, Sagemaker, and as far as I’m 


aware most major tools. 


Depending on your cost and performance goals this may be a critical issue 


or of least concern. On a practical level for most use cases: 


e Expect that additional data will be generated. 


e Know where the data is being stored - but don’t overthink it. 


In the same way, we don’t really question how much storage say a Postgres 
Write Ahead Log (WAL) generates, my personal opinion is that it’s best to 
trust the training data tool in this regard. If there’s an issue - address it 


within the training data tools realm of abstraction. 


Bucket connection 


Saying all this above - the best abstraction is to create a connection from the 
training data tooling to the bucket. There are various opinions on how to do 


this and the specifics vary by hardware and cloud provider. 


[ insert example from diffgram UI of file browser] 


[ insert example of solution schema with training data tool and bucket] 


For the non-technical user, this is essentially “logging in” to a bucket. In 
other words, I create a new bucket A and it gets a user id and password 
(client ID / client secret) from the IAM system. I pass those credentials to 
the Training Data tool and it stores them securely. Then when needed the 


Training Data tool uses those credentials to interact with the bucket. 


Raw Media (BLOB) Type Specific 


A BLOB is a Binary Large Object. It is the raw data of media. Depending 
on your media type, there will be specific concerns. Each media type will 
have vastly more concerns than can be listed here. You must research each 
type for your needs. Here I note some of the most common considerations 
to be aware of. A Training Data Database will help format BLOBs to be 


useful for multiple end users, such as Annotators and Data Science. 
Images 


Images usually don’t require any technical data prep because they are 
individually small enough files. More complex formats like TIF will often 


be “flattened”, although it’s possible to preserve layers in some tools. 


Video 


It’s common to split video files into smaller clips for easier annotation and 


processing. For example, splitting a 10-minute video into 60 second clips. 


Sampling frames is one approach to reduce processing overhead. This can 
be done by reducing the frame rate and extracting frames. For example, 
converting a 30 frames per second (FPS) to 5 or 10. The drawback is that it 
makes it more difficult to annotate or use other relevant video features like 
interpolation or tracking. Usually it’s best to keep the video as a playable 
video and also extract all the frames required. This improves the end user 


annotation experience and maximizes annotation ability. 


Event-focused analytics need the exact frame when something happens 
which effectively becomes lost if many frames are removed. Further, the 
full data is available to be sampled by “find interesting highlights” 
algorithms. This leads Annotators to see more “interesting” things 
happening and higher quality data. Object tracking and interpolation further 
drives this point home, an annotator may only need to label a handful of 
frames and often get many back “for free” through those algorithms. And 
while nearby frames are generally similar in practice it often still helps to 


have the extra data. 


An exception to this is that sometimes very high FPS video (e.g. 240-480+) 
may still need to be sampled down to 120 FPS or similar. Note that just 
because many frames are available to be annotated, we can still choose to 
only train models on completed videos, completed frames, etc. If you must 
downsample the frames, use the global reference frame to maintain the 


mapping of the downsampled frame to the original frame. 


3D 


Usually you will need to transmit each files series of x,y,z triples to the 


SDK. 


Text 


Tokenizer 
You will need to select your desired tokenizer or confirm the tokziner 
used by the system meets your needs. The tokenizer divides up words, 


for example based on spaces, or more complex algorithms. 
Medical 
If your specific medical file is not supported by the tool you may need to 


1. Downsample color channels 
2. Select which z-axis or slice you wish to use 


3. Crop out images from a too - large single image. 


Geospatial 


GEOTiff (and cloud GEOTiff COG) is a standard format. Be aware that a 


projection mapping change may be needed to standardize layers. 


Formatting & Mapping 


Raw media is one part. Annotations and predictions are another big part. 
Think in terms of setting up data definitions rather than a one time import. 
The better your Definitions, the more easily data can flow between ML 
applications, and the more end users can update the data without 


engineering. 
User Defined Types (Compound Files) 


Real world cases often involve more than one file. For example, a Driver’s 
license has a front and a back. We can think of creating a new user defined 
type of “Drivers License” and then having it support two child files, each 
being an Image. Or we can consider a “Rich Text” conversation that has 


multiple text files, images, etc. 


Defining DataMaps 


A DataMap handles the loading and unloading definitions between 
applications. For example, loading data to a model training system or a 
What to Label analyzer. Having these definitions well defined allows 
smooth integrations by end users, and decouples the need for engineering 
level changes. In other words, it decouples when an Application is called 
from the data definition itself. Examples include declaring a mapping 
between spatial locations formatted as x_min, y_min, x_max, y_max and 
top_left, bottom_right. Or mapping integer results from a model back to a 


Schema. 


Ingest Wizards 


A new item is UI based ingestion wizards. This started originally with file 
browsers for cloud based systems. And has progressed into full grown 
mapping engines, similar to smart switch apps for phones, where I use an 


app to move all my data from android to iphone or versa. 


At a high level a mapping engine steps you through the process of mapping 


each field from one data source to another. 


Mapping wizards offer tremendous value. They save having to do a more 
technical integration. They typically provide more validations and checks to 
ensure the data is what you expect (picture like seeing a email preview in 
gmail before committing to open the email). And best of all once the 
mappings are setup, then can easily be swapped out from a list without any 


context switching! 


The impact of this is hard to understate. Before you may have been hesitant 
to say try anew model architecture, commercial prediction service, etc. 
because of the nuances of getting the data to and from it. This dramatically 


relieves that pressure. 


What are the limitations of wizards? Well first some tools don’t support 
them yet so it may not be available yet. It may impose technical limitations 


that are not present in more pure API calls or SDK integrations. 


One of the biggest gaps in tooling is often around the question “How hard is 


it to set up my data in the system and maintain it?” 
Then comes what type of media it can ingest? How quickly can it ingest it? 


This is a problem that’s somewhat distinct from other software. You know 
when you get a link to a document and you load it for the first time? Or 


some big document starts to load on your computer? 


Recently new technology like “Import Wizards” - step by step forms - have 
come up that help make some of the data import process easier. While I 
fully expect these processes to continue to become even easier over time, 
the more you know about the behind the scenes aspects, the more you 


understand how these new wonderful wizards are actually working. 


Bucket 
A technology that stores raw data such as images and video. Also called: 


Object Store 


Binary Large Object - BLOB 


Raw data such as images, video, text. Also known as: Object 


Organizing Data and Useful Storage 


One of the first challenges is often how to organize the data you have 


already captured (or will capture). One reason this is more challenging than 


it may at first appear is often these raw datasets are stored remotely. 


At the time of writing cloud data storage browsers are generally less mature 
than local file browsers. So even the most simple operations, for example 


myself sitting at a screen and dragging files, can take on a new challenge. 
Some practical prescriptions here: 


e Try to get the data into your annotation tool sooner in the process then 
later. For example, at the same time new data comes in, if I write the data 
reference to the annotation tool at a similar time I’m writing to a generic 
object store I can “automatically” organize it to a degree, and or more 
smoothly inlist team members to help with organization level tasks. 

e Consider using tools that help surface the “most interesting” data. This is 
an emerging area - but it’s already clear that these methods, while not 
without their challenges, have merit and appear to be getting better. 

e Use Tags. As simple as it sounds, tagging datasets with business level 
organizational information helps. E.g. the Dataset “Train Sensor 12” can 
be tagged “Client ABC”. Tags can cross cut data science concerns and 
allow both business control/organization and data science level 


objectives. 


Remote Storage 


Data is usually stored remotely relative to the end user. This is because of 
the size of the data, security requirements, automation (eg connecting from 
an integrated program, practicalities of running model inference, 
aggregating data from nodes/system), etc. And teammates, the person who 
administers the training data, may not be the person who collected the data 


(consider use cases in medical, military, on-site construction etc). 


This is relevant even for solutions with no external internet connection, also 
commonly referred to as “air gapped” secret-level solutions. In these 
scenarios, it’s still likely the physical system that houses the data will be a 


different box than the end user even if sitting 2 feet from each other. 


The implication of this is we need to somehow access the data. At the very 
least by whomever is annotating the data, and most likely also by some kind 


of data prep process. 


Versioning 


Versioning is important to reproducibility. That said, sometimes versioning 
gets a little “too much” attention. In practice, for most use cases being 
mindful of the high level concepts, using snapshots, and good definition 


will get you very far. 


There are 3 primary levels of data versioning, Per Instance (Annotation), 


Per File, And Export. 


Their relation to each other is shown in Fig 4.5. 


Per Instance History File Versions Export Versions 


Low Level High Level 


Figure 4-5. Versioning High Level Comparison 


Here we introduce them at a high level. 
Per Instance History 


By default Instances are not hard deleted. When an edit is made to an 
existing instance, Diffgram marks it as soft delete and creates a new 
instance that succeeds it, as shown in Figure 4.6. For example, use this for 
deep dive annotation or model auditing. It is assumed that soft_deleted 


instances are not returned when the default instance list is pulled for a file. 


Instance Diff 


Instance History © 
2» [7231342] instance Edited « a 
o 04-07-2021 19:59:08 
Use FEY Anthony Sarkis “x min": 786, ¥ 


6 [7231339] Instance Resto : : 
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o @ [7231337] Instance Deleted "y min": 351. = 


04-07-2021 19:58:36 
User Anthony Sarkis “width”: 267. ¥ 


es 
2» [7231326] instance Edited a © 
So “Oo 04-07-2021 19:88:24 ” “height”: 260 ¥ 
ee Anthony Sarki + 5 
“height”: 236 = 
Y [7231335] Instance Created = 
oO 04-07-2021 19:58:29 
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Figure 4-6. Left: Per Instance History in UI. Right: A single differential comparison between the 
same instance at different points in time. 


Per File & Per Set 


Each set of Tasks may be set to automatically create copies per file at each 
stage of the processing pipeline. This automatically maintains multiple file 


level versions relevant to the task schema. 


You may also programmatically and manually organize and copy data into 
sets on demand. Filter data by tags, such as by a specific machine learning 


run. Compare across files and sets to see the diff on what’s changed. 


Add files to multiple sets for cases where you want the files to always be on 


the latest version. That means you can construct multiple sets, with different 


criteria, and instantly have the latest version as annotation happens. 


Crucially this is a living version, so it’s easy to always be on the “latest”. 


For example, use these building blocks to flexibly manage versions across 


work in progress at the administer level. 


Per Export Snapshots 


Every export is automatically cached into a static file. This means you can 
take a snapshot at any moment, for any query, and have a repeatable way to 
access that exact set of data. Combine with webhooks, SDK, or Userscripts 
to automatically generate exports. Generate them on demand anytime. For 
example, use this to guarantee a model is accessing the exact same data. 


The header of an export is shown as an example in Fig 4.7. 


cs 
i? | 


Figure 4-7. Export UI Listview Example 


Exporting and access pattern trade-offs will be covered in more detail in 


Data Access. 


Data Access 


So far we have covered overall architectural concepts, such as using a 
Training Data Database to avoid a game of telephone. And concepts and 
BLOBs and formats. Now we discuss the Annotations themselves. A 
benefit of using a Training Data Centric approach is you get best practices, 


such as snapshots and streaming built-in. 


Streaming is a separate concept from Querying, but goes hand in hand in 
practical applications. For example you may run a query that results in 
1/100 of the data of a file level export, and then stream that slice of the data 


directly in code. 
There are some major concepts to be aware of 


e File Based Exports 
e Streaming 


¢ Querying Data 
Disambiguating Storage, Ingestion, Export, and Access 


Data storage 
Is the literal storage of the BLOBs and not the Annotations, which are 


assumed to be kept in a separate database. 


Ingestion 
Is about the throughput, architecture, formats, and mapping of data. 


Often this is between other applications and the Training Data system. 


Export 
In this context usually refers to a one time file based export from the 


Training Data system. 


Data Access 


Is about querying, viewing, downloading BLOBs and Annotations. 


One way to think of this is that often the way data is interested in a database 
is different from how it’s stored at rest, and how it’s queried. Training data 
is similar. There’s processes to ingest data, different processes to store it, 


and different again to query it. 


Modern Training Data systems store Annotations in a Database (not a 


JSON dump), and provide abstract query capabilities on those Annotations. 


File Based Exports 


As mentioned in versioning, a file based export is a moment in time 
snapshot of the data. This is usually generated only on a very rough set of 
criteria, e.g. a Dataset name. File based exports are fairly straight forward 
so I won’t spend much time on them. A comparison of trade-offs of File 


Based to Streaming is covered in the next section. 


Streaming Data 


Classically, annotations were always exported into a static file. For example 


a JSON file. 


Instead of generating an export as a one off thing you stream the data 


directly into memory. 


This is the question “How to get my data out of the system?”. All systems 
offer some form of export. Is it a static one term export? Is it direct to 


tensorflow or pytorch memory? 
Streaming Benefits 


e Load only what you need, which may be a small percent of a JSON file. 
At scale it may be impractical to load all the data into a JSON file and so 
this may be a major benefit. 

¢ Works better for large teams. Avoids waiting for static files. You can 
program and do work with the expected dataset before Annotation starts, 
or even while Annotation is happening. 

¢ More memory efficient - because it’s streaming, you need not ever load 
the entire dataset into memory. This is especially applicable for 
distributed training, and when marshaling a JSON file would be 
impractical on a local machine. 

e Saves having “double mapping” e.g. mapping to another format, which 
will itself then be mapped to tensors. In some cases too parsing a JSON 
file can take even more effort then just updating a few tensors. 


¢ More flexibility, format can be defined and redefined by end users. 


Streaming Drawbacks 


¢ Defined in code. If the dataset changes may affect reproducibility unless 
additional steps are taken. 

e Requires a network connection. 

¢ Some legacy training systems / AutoML providers may not support 


loading directly from memory and may require static files. 
This is different from exporting in that it implies a direct connection. 


One thing to keep top of mind throughout this is that we aren’t really 
wanting to statically select folders and files. We are really setting up a 
process in which we stream new data, in an event driven way. To do this, 
we need to think of it more like assembling a pipeline, then the mechanics 


of getting a single existing known set. 


Example 


In this example we will fetch a dataset and stream it using the Diffgram 


SDK 


!pip install diffgram 

from diffgram import Project 

# Public project example 

coco_project = Project(project_string_id='coco-dé 
default_dataset = coco_project.directory.get(name 
# Let's see how many images we got 

print('Number of items in dataset: {}'.format(Ler 


# Let's stream just the 8th element of the datase 
print('8th element: {}'.format(default_dataset[7 
pytorch_ready_dataset = default_dataset.to_pytor< 


» 


Queries Introduction 


Each application has its own query language. This language often has 
special structure specific to the context of Training Data. It also may have 


support for abstract integrations with other query constructs. 


To help frame this, let’s start with this easy example to get all files with 


more than 3 cars and at least 1 pedestrian. 


dataset = project.dataset.get('my dataset' ) 
sliced_dataset = dataset.slice('Labels.cars > 3 % 


» 


Integrations with Ecosystem 


To do model training and operations there are many applications. At the 
time of writing there are many hundreds of tools that fall in this category. 
As mentioned earlier you can set up a mapping of definitions for formats, 
triggers, dataset names and more in your Training Data tooling. We will 


cover these concepts in more depth in later chapters. 


Model Run Also known as: Predictions 


Running a machine learning model on a sample or dataset. For example, 
given a model X, inputting an image Y, and returning a prediction set Z. 
In a visual case this could be an object detector, an image of a roadway, 


and a set of bounding box type instances. 


Security 


Security of Training Data is of critical importance. Often Raw data is 
treated with greater scrutiny from a security view. For example, raw data of 


critical infrastructure, drivers license, military targets, etc. 


Security is a broad topic that must be researched thoroughly and separate 
from this book. I specifically am calling attention to the most common 
security items in the context of Data Engineering. Please note that in other 
sections we have discussed more broad Security topics, such as security of 


the overall Training Data system 


e Access Control 
e Signed URLs 


e Personally Identifiable Information 


Access Control 


One of the big questions that can help us differentiate this here is: 


¢ What system is processing the data? What are the Identify Access 
Management (IAM) permission concerns around system level processing 
and storage? 


e What user access concerns are there? 


Identity and Authorization 


Production level systems will often use Open ID Connect (ODIC). This can 
be coupled with Role Based Access Control (RBAC) and Attribute Based 
Access Control (ABAC),. 


Those are all complex topics and general treatment of them is beyond the 
scope of this book. Specific to Training Data, often the Raw Data is where 
there is the most tension around access. In that context usually it can be 
addressed on either the per file or per set level. At the per file level, access 
must be controlled by a policy engine that is aware of the triplicate of the 
{user, file, policy}. This can be complex to administer at the per file level. 
Usually it is easier to achieve this at the per set level. At the set (dataset) 


level, it is achieved with {user, set, policy}. 
One example 


In this code sample we will create a new dataset, a new security advisor 
role, and allow the {view, dataset} permission abstract object pair on that 


role. 


restricted_ds1 = project.directory.new(name='Hid¢ 
advisor_role = project.roles.new(name='security_ 
advisor_role.add_permission(perm='dataset_view', 


» 


We then assign the user (member) to the restricted dataset. 


member_to_grant = project.get_member(email='secul 
advisor_role.assign_to_member_in_object (member_1ic 


» 


Alternatively, this can be done with an external policy engine. 


Signed URLs 


Signed URLs are a technical mechanism to provide secure access to 
resources, most often raw media BLOBs. A signed URL is the Output of a 
security process that involves Identity and Authorization steps. A signed 
URL is most concretely thought of as a one time password to a resource, 
with the most common added caveat that it expires after a preset amount of 
time. This expiry is sometimes as short as a few seconds, routinely under 
one week, and rarely “virtually indefinite” such as multiple years. Signed 
URLs are not unique to Training Data and you must research them well as 
they appear simple but contain many pitfalls. We touch on signed URLs 


here only in the context of Training Data. 


One of the most critical things to be aware of is that because signed URLs 
are ephemeral it is not a good idea to transmit signed URLs as a one time 
thing. Doing so would effectively cripple the Training Data system when 
the URL expires. It is also less secure, since either the time will be too short 
to be useful, or too long to be secure. Instead, it is better to integrate with 
your Identity and Auth system. This way signed URLs can be generated on 
demand by specific {User, Object/Resource} pairs. That specific user can 


get a short expiring URL. 


In other words, you can use a service outside the Training Data system to 
generate the signed URLs so long as the service is integrated directly with 
the Training Data system. Again it’s important to move as much of the 
actual organizational logic and Definitions inside the Training Data system. 
Single-Sign-On (SSO) and Identity Access Management (IAM) integration 
commonly cross cut Databases and Applications, so that’s a separate 


consideration. 


Training Data systems offer new ways to Secure the data. This includes 
transmitting Training Data directly to ML programs, bypassing a single 
person having unnecessary access. I encourage you to read the latest 
documentation from your Training Data system provider to stay up to date 


on the latest security best practices. 


Cloud connections & Signed URLs 


Whomever is going to supervise the data needs to view it. This is the 
minimum level of access and essentially unavoidable. Prep systems, such as 
the system that removes Personally Identifiable Information (PII), generates 
thumbnails, pre-labels, etc. also needs to see it. Also - for practical system 
to system communication it’s often easier to transmit only a URL / file path 
and then have the system directly download the data. This is especially true 
because many end user systems have much slower upload rates then 
download rates. For example imagine saying “use the 48 GB video at this 


path” (KBs of data) vs trying to transmit 48 GB from your home machine. 


There are many ways to achieve this but signed URLS - a per resource 
password system - are currently the commonly accepted method. They can 
be “behind the scenes” - but generally always end up being used in some 


form. 


For both good and bad reasons this can sometimes be an area of 
controversy. I'll highlight some trade-offs here to help you decide what’s 


relevant for your system. 


SignedURLs 

A URL that contains both the location of a resource (like an image) and 
an integrated password. Similar to a “share this link” in Google Docs. 
Signed URLs may also contain other information and commonly time 
decay, meaning the password expires. For example a signed URLhave 


the general form of: sample.com/123/?password=secure_password 


(Actual signed URLs are usually quite long, about the length of this 


paragraph or more.) 


The reason that we make a bit of a distinction here and get so specific about 
IAM is because training data presents some unusual data processing 


concerms: 


e Humans see the “raw” data in ways that are uncommon in other systems 

e Admins usually need fairly sweeping permissions to work with data - 
again in ways uncommon in classic systems 

e There are data size and processing concerns that while perhaps not really 
different from classically, are new and there are less established norms 


for what’s reasonable. 


Personally Identifiable Information (PIT) 


Two of the most common ways to address PII is to remove it, or to create a 


PII compliant handling chain. 
PIT Removal 


The PII may be contained in metadata (such as DICOM). You may wish to 
either completely wipe this information, or retain only a single ID linking 


back to a separate database containing the required metadata. 


The PII may be contained in the data itself. For images, for example, this 


may involve blurring faces and identifying marks such as house numbers. 


This will vary dramatically based on your local laws and use case. 


PII Identification and Training 


The dataset contains PII and cannot or will not be changed. Perhaps the PII 
is desired or useful for training. This requires having PII complaint data 
chain, PII training for staff, and generally appropriate tagging identifying 


the elements as containing PII. 


Pre-Label 


Supervising model predictions is common. It is used to measure system 
quality, improve the training data (annotation), and alert on errors. We will 
discuss the pros and cons of pre-labeling in later chapters. Here is a brief 
introduction to the technical specifics. The big idea with pre-label is that we 
take the output from an already run model and surface it to other processes, 


such as human review. 
Updating Data 


It may seem strange to start with the update case, but since records will 
often be updated by ML programs, it’s good to know update plans prior to 


running models and ML programs. 


If the data is already in the system, then you will need to refer to the file ID, 


or some other form identification such as the filename to match with the 


existing file. For large volumes of images, frequent updates, video etc, it’s 
much faster to update an existing known record than to re-import and re- 


process the raw data. 


It’s best to have the Definitions between ML programs and Training Data 


defined in the Training Data program. 


If that is not possible, then at least include the Training Data File ID along 
with the data the model trains on. Doing this will allow you to more easily 
update the File later with the new results. This ID is more reliable than a 


filename because filenames are often only unique within a directory. 


Pre-Label Gotchas 


Formats, for example video sequences, can be a little difficult to wrap the 
head around. This is especially true if you have a complex Schema. For 
these, I suggest making sure the process works with an image, and/or the 
process works with a single default sequence before trying true multiple 


sequences. SDK functions can help with pre-label efforts. 


Some systems use relative coordinates and some use absolute. It is easy to 
transform between these as long as the height and width of the image is 


known. An example transformation to relative coordinates is defined as 


x / width and y / height. For example: A point x,y (120, 90) 
with a width/height (1280, 720) would have a relative value of 120/1280 


and 90/720 or (0.09375, 0.125). 


If this is the first time you are importing the raw data, then it’s possible to 
attach the existing instances (annotations) at the same time as the raw data. 


If it’s not possible to attach the instances, then treat it as Updating. 


A most common question is: “Should all machine predictions be sent to the 
Training Data Database?” The answer is Yes, as long as it is feasible. Noise 
is noise. There’s no point sending known noisy predictions. Many 
prediction methods generate multiple predictions with some threshold for 
inclusion. Generally, whatever mechanism you have for filtering this data 
needs to be applied here too. For example, only taking the highest 
“confidence” prediction. To this same end, in some cases it can be very 
beneficial to include this “confidence” value or other “entropy” values to 


help better filter training data. 


Pre-Label data prep process 


Now that we have covered some of the abstract concepts, let’s dive into 
some specific examples for selected media formats. We cannot cover all 
possible formats and types in this book, so you must research the docs for 


your specific Training Data system, media types, and needs. 
As shown in Fig 4-8 a pre-label process involves the following steps: 


1. Map your Data to your Annotation Tool’s format. 


2. Attach Data to Annotation Tools’s Files at Import 


3. Verify Data 


Importing Instances Overview 


1. Map your Data to | | 2. Attach Data to 3. Verify Data. 


Diffgram format. File at Import. 


Figure 4-8. Block diagram example 


Usually there will be some high level format, such as saying that an image 
may have many instances associated with it. Or that a video may have many 


frames, and each frame may have many instances, as shown in Fig 4-9. 


Instance 


Figure 4-9. Visual overview of relation between raw media and instances 


Example python code: 


def mock_box( 
sequence_number: int = None, 
name : str = None): 
return { 

"name" : name, 

"number": sSequence_number, 

"type" ‘ UD OxXae 

Xanax: Gandom= ranclnt(s.00)s.S00))- 

"X min": random.randint(400, 499), 

‘yonax, = (GandomeranGint (5007 S00), 

"y min": random.randint(400, 499) 

i 
This is one “instance”. So for example running tt 
instance = { 

“name" : “Example”, 

"number": 0, 

"type" ; OX 

'x<omax, = 500; 

xan. 400), 

"ymax, > 500; 

"y_min": 400 
} 
We can combine instances into a list to represent 
instance = {} 
instance_list = [instance, instance, instance] 


From Google Answers 


Chapter 5. Annotation Automation 


A NOTE FOR EARLY RELEASE READERS 
With Early Release ebooks, you get books in their earliest form—the 
author’s raw and unedited content as they write—so you can take advantage 


of these technologies long before the official release of these titles. 


This will be the 5th chapter of the final book. Please note that the GitHub 


repo will be made active later on. 


If you have comments about how we might improve the content and/or 
examples in this book, or if you notice missing material within this chapter, 


please reach out to the editor at jleonard@oreilly.com. 


Introduction 


Too much work? Is annotation tedious? Quality poor? 
In this chapter I show you how to solve these problems with automations. 


The first topic is Pre-Labeling - the idea of running a model before 
annotation. I’ll cover the caveats you should be aware of and step through 


extensions of the idea such as ‘micro-model’ labeling. 


Next, Interactive Automations are when a user adds information in order to 
help the algorithm. For example drawing a box to automatically get a 
tighter location marked by a polygon. Interactive improvements are often 
most relevant to spatial locations - however that’s just the start. The end 
goal of Interactive Automations is to make tedious UI work a more natural 


extension of human thought. 


Quality Assurance (QA) is one of the common uses of training data tools. I 
cover exciting new methods like using the model to debug the ground truth. 
Other tools automatically check base cases and look at the data for general 


reasonableness. 


Pre-Labeling, Interactive Automatinos, and QA tools will get you far. After 
covering the foundations, I’ll walk through key aspects of data exploration 
and discovery. What if you could query the data and only label the most 
relevant parts? This area includes concepts like filtering an unknown 


dataset down to manageable size and more. 


There are many domain specific options that can be implemented as well. 
For instance, if you have multiple sensors, you can get a “two for one” deal 
by labeling one sensor, and using known geometry to estimate into the other 
sensors. Incorporating interpolation and object tracking can speed up Video. 
Using dictionaries and heuristics can help with text. I’1l walk through all of 


these at a high level so you are confidently aware of them. 


I will touch on data augmentation, common ways it’s used, and cautions to 
be aware of. When we augment data, we derive new data based on the 
existing base information. From that viewpoint it’s easier to think of the 
base information as the core training data and then the deriving process - 
augmentation - as a machine learning optimization. So while a portion of 
the responsibility here is outside the scope of training data we must be 


aware of it. 


Simulation and Synthetic data have situation specific uses but we must be 


upfront about the performance limitations. 


Overall, there are many high quality approaches that will substantially 
automate your training data process. There are also limitations and risks to 
consider when implementing automation. I will provide a high-level 
overview of the most common methods you can actually use today, what 
results to expect, the trade-offs involved, and how they work together. For 
the most popular methods I will take a deeper dive into some of the 


specifics and for the rest I will provide a general introduction. 


We have a lot to unpack and experiment with in this chapter. Let’s get 
started by taking a closer look at the project planning process and 


techniques that are commonly used today. 


Getting Started 


The motivation to use automation approaches are the problems encountered 
working with Training Data. From high labor costs, lack of available 
people, tedious work, or even cases where it’s nearly impossible to get 


enough raw data. 


Some automations are more practical than others. I outline what people 
actually use for most projects. Followed by what kind of results you can 
expect and not expect. This is rounded out with the two most common 
confusions about automations - fully automatic labeling and proprietary 


methods. 


I wrap this section by looking at the costs and risks. This section is a view 
of how the concepts map together, and ultimately to how they actually help 
your work. It can also help direct your reading and act as a reference to 


quickly look up common solution paths. 


Motivation: When to use these methods? 


Here I state the typical problem that leads people to seek the following 


method. 


Table 5-1. Reference of Problem and corresponding Automation Solution 


Problems 

Too much routine work 

Annotations too costly 

Subject matter expert labor cost too high 


Data annotated has a low value add 


Spatial annotation is tedious 


Annotator day to day work is tedious 


Annotation quality is poor 


Obvious repetition in annotation work 


It is near impossible to get enough of the original raw 
data 
Raw data volume clearly exceeds any reasonable 


ability to manually look at it. 


‘ To avoid tedious work and reduce annotator fatigue 


Solutions 


Pre-label 


Pre-Label 

Data Discovery 
Interactive 
Automations 
Pre-Label 
Quality 
assurance tools 
Pre-label* 
Pre-label 
Data-Discovery 
Simulation & 
Synthetic 


Data Discovery 


Note: Check if the method is designed to work with Schema or Location 


Some automations work on both the schema (label, attributes) and spatial 
location (box, token position). Some work on only one or the other. For 
example, object tracking is generally oriented towards spatial and doesn’t 
usually help with meaning. When it matters I will highlight this difference, 


otherwise it usually comes down to the implementation. 


As an example of this impact, a method that offers a 2x improvement on the 
spatial location will be of little importance if 90% of the time is in the 
Schema - defining the meaning of what is actually present. This is relevant 


to all automations that impact annotations directly. 


What do people actually use? 
With so many new concepts and options it can be easy to feel overwhelmed. 


While these methods are always changing here I highlight the top methods I 
have seen people use successfully and are the most broadly applicable 


across all projects. The two questions I will most try to answer: 


1. What are the best practices? 
2. What do you need to know to be effective and successful with training 


data automations? 


Most projects usually can use these techniques: 


These approaches may be commonly used however it does not mean they 


are easy, always applicable, etc. 


You must still have sufficient expertise to use the approaches effectively - 


expertise you will gain in this chapter. 


1. Pre-labeling! 

2. Interactive automations 
3. Quality assurance tools 
4. Data discovery 


5. Augmentation of existing real data. 


Domain Specific 


These techniques depend on the data and sensor configuration. They can 


also be more costly, or require more assumptions about the data. 
Special purpose automations, like Geometry, Multi-sensor approaches. 
Media specific, like Video Tracking and Interpolation 

Simulation & Synthetic 


A note on Ordering 


In theory there are some order to these methods, however, in practice these 


methods have very little universal order of usage. You can pre-label as part 


of a data-discovery step, during annotation, or much later to grade 
production predictions. Sometimes data discovery is only really relevant 
after some significant percent of the data has been labeled. Therefore, I 


have put the more common methods towards the start of the chapter. 


What kind of results can I expect? 


Being a new area it’s common for there to be few expectations about what 
these methods can do. Sometimes people invent or guess at the expected 
results. Here I unpack some of the expected outcomes of using these 


methods. 


1. First, is the method itself 
2. Next is the expected result when well implemented 


3. And finally the “Is not” covers the most common issues and confusions 


Table 5-2. Expected results for each annotation automation technique 


Method Does: Does not: 
Pre-Labeling Reduce the less meaningful Solve labeling 
work entirely. 
Usually within a single A human must still 


sample - e.g. an image, text, review the data and 
or video file. do further work with 
Shifts the focus to correcting it. 

odd cases 


A building block for other 


methods. 
Interactive Reduce tedious UI work. Completely eliminate 
Automations For example, drawing a box, all UI work. 


and getting a polygon drawn 
tight around an object in the 
box. 
Quality Assurance Reduce manual QA time on Replace human 
Tools ground truth data. reviewers entirely. 
Discovery novel insights on 
models. 
Data Discovery | Makes human time focused Work well in cases 
on the most meaningful data. where there is already 
Avoids unnecessary similar a well functioning 


annotations. 


model and the main 


goal is grading it 


Augmentation Provide a small lift in model Work without having 
performance. base data. 

Simulation & Cover otherwise impossible Work well for cases 

Synthetic cases. where raw data is 


Provide a small lift in model already abundant or 
performance. situations are 
relatively common. 
Special purpose —_ Reduces routine spatial 
automations, like (drawing shapes) work by 
Geometry, Multi- using geometry based 
sensor approaches. projections. 
Some limited methods can 
work largely independent of 
humans, others usually 
require labeling of the first 
sensor and fill the rest. 
Data type specific, Reduces routine work by 
like Object utilizing known aspects of 
Tracking (Video), the media type. 
Dictionary (Text), 
AutoBorder 


(Image). 


Common Confusions 
Before we dive into this let’s address two of the most common confusions. 
Fully Automatic Labeling 


The myth: “We don’t need labeling, we made it automatic.” 


When an existing AI predicts something that is a form of automatic 
labeling. In this chapter we are concerned about how to create that AI 
model in the first place. Meaning some form of non-automatic labeling first. 
If someone was to ever solve truly automatic labeling of arbitrary data they 
will have created the elusive idea of Artificial General Intelligence. While 


this may eventually happen it is out of scope of commercial concerns today. 


Proprietary Automatic Methods 


Sometimes true, sometimes myth: “Use our method for 10x better results” 


The intent of this chapter is to convey the general methods and concepts 
that are available. As you will see, as these methods stack with each other, it 
is possible to get relatively dramatically better results than purposefully 


going as manually and slowly as possible. 


However, there are a vast number of concrete implementations. The single 
most common theme I have seen with secret vendor specific approaches is 


that usually they are very narrow in scope - for example it may work for 


only one type of media, one distribution of data, one spatial type etc. It can 
be difficult to verify in advance if your use case happens to meet those 


assumptions. Statistically it’s unlikely. 


The best way around this is to: 


Be aware of the general approaches as explained in this chapter. 


Trend towards tools that make it easy to run the most up-to-date research 


and your own approaches. 


Risks 


All automations, even those operating as expected, introduce risk. Here are 


a few that specific to annotations. 
Lack of Net Lift 


This is when automation doesn’t actually help. 


This is surprisingly common, especially ones that involve user interface. 


Worse Results 


Automations can make results worse. 


For example the super-pixel approach can lead to blotchy annotations that 


are less accurate than a traced polygon. Don’t assume the failure state is 


equal to manual annotations. 


Cost Overruns 


As detailed in the next section automations have many costs, from time to 


implement, hardware cost, human training cost etc. 
Method Specific Risks 
Every method is unique and comes with it’s own unique risks. 


Even seemingly similar methods can have different effects, for example 
correcting a pre-labeled box is different from a pre-labeled polygon, and 


different again from an attribute classification. 


Treating it too Casually 


Sometimes these automations feel like just a little time saver, or some kind 
of obvious thing. We have to remember that every automation is like a mini 


system. That system can have errors and cause problems. 


Costs Expected 
Automations can sometimes feel like magic. 


Wouldn’t it be nice if the computer could just do x for us? This magic 


feeling often seems to shroud some of these methods in an almost religious 


zeal. 


My point with this section is to help you evaluate the total costs of the 
automations. This is especially important because all automations have 


costs and some are proposed with the most casual air. 


In cases outside of training data a single automation may have a big 


planning team, documentation, risk analysis, etc. 


Yet somehow here, there seems to sometimes be an expectation that these 
automations will just work. It’s almost like we expect there to be magic, and 
so we quickly lose our rational thinking cap and just keep paying since any 


price appears fine as long as it finally works it’s magic. 
Setup Costs 


Virtually all automations require some form of setup. 


Even the ones with the most minimal of setup require a degree of training 


and understanding around their assumptions. 


Scope 


Imagine you have a lawn to mow. It’s a small front lawn about the size of a 
parking spot. You can get a push mower or a ride-on-mower. The ride-on, 
being about as big as the parking spot, is so effective it can pretty much just 


start up and instantly cover the entire lawn. 


Naturally the ride-on-mower is orders of magnitude more expensive than 
the push mower, it requires storage, supplies, gas, maintenance, etc. So 
while once in place it will mow the lawn the fastest, it’s startup costs and 


ongoing costs far outweigh the benefits. 


Setting a good scope of automations is important. A few questions to 


consider: 


1. Does the automation require technical integration? Can it be done from a 
UI or through a wizard? 

2. What are the expected startup and maintenance costs? 

3. Is an off the shelf method adaptable to our needs? Do we need to write a 


specialized automation? 


If this sounds like a project management discussion that’s good - because 
that’s what each automation is - a project. Obviously a giant lawn mower is 
too large for a small lawn. So what are the right sizing metrics for training 


data? 


Compare to Best Practices not manual annotation 


Completing an entire project with 100% manual annotation is rare. By 
manual annotation, I mean no model review, no pre-labeling, no UI 
assistance features, none of the methods in this chapter. Yet automation 
methods are often presented as a percent savings relative to this mythical 


100% manual annotation. 


Instead of thinking of it as a comparison to manual annotation, think of it as 
comparing to best practices. And best practice is to use these automation 


methods appropriately. 


Benchmark questions: 


1. Are we reasonably aware of all commonly available methods? Of the 


ones relevant to our case, what percent of them are we using? 


For example if there are four methods available and you are using all four 


then at the strategic level you are already there. 


1. By survey, how unique do annotators feel the work they are doing is? Do 
they feel they are making small adjustments or adding foundational 


value with every annotation? 


This can still be quantitative - but the point is to get away from lower level 


metrics that are easily gamed or noisy. 


1. What is the ROI of our automation methods? 


For costs, consider all supplies (e.g. hardware), vendor costs, administration 
and data science costs. For the benefit, if possible try the actual next best 


alternative. This requires being clear eyed about the comparison. 


Account for Correction Time 


For example, if a spatial speed up method is being used, try just drawing the 
polygon points directly by tracing the object. When you account for 


correction time, which method was actually faster. 


I have seen research papers that reference “clicks”. They say something 
like: We reduced the number of clicks from 12 to 1 therefore it’s 12 times 
better. But in reality, there is another tool that allows you to trace the 
outline, meaning that there are no clicks, and it’s just as fast to trace the 


outline as it is to fiddle with the errors the automation makes. 


Table 5-3. Example comparison of time to draw vs time to correct 


Tool Time to Draw Time to Correct 
Trace Outline 17 seconds N/A 
Magic Tool Automation3 seconds 23 seconds 


As you can see in this example the magic tool still takes some time to draw. 
In this case it takes more time to correct. Meaning that while it appears to 


be faster it’s in fact slower. 


Subject Matter Expert Required 


It’s also worth understanding that virtually all of these methods still end up 
requiring a similar level of subject matter expertise. An analogy is that 
collectively the tooling represents something akin to a word processor. It 


Saves you having to buy ink, to mail letters, etc. But it doesn’t write the 


document for you. It automatically formats the characters you instruct it to 


write. 


The challenge with automation methods here is that 


The startup costs and ongoing costs of setting up these automations is 


unknown or hard to predict 


The good news is that many of these improvement methods stack. The bad 
news is that most of the methods make a variety of assumptions, either 


about the data, your level of effort, etc. 


Sometimes people discussing these methods will speak to savings relative 


to 100% manual annotation. 


It’s worth saying right from the get go that 100% manual annotation is rare. 


Essentially, these automations over time appear to be shifting from being 


thought of as “savings” and more just “standard expectations”. 


This is not meant to take away the magic from some of these methods. It’s 
quite incredible how quickly a dataset can be improved by careful 


combination of these methods. 


Even today, instead of thinking of these as savings, think of it more as 


“making what would otherwise be impossible, possible”. 


Let’s dive in! 


Pre-Labeling 


Pre-Labeling is usually focused on labels within a sample. For example 
bounding boxes or segmentation masks in an image. Pre-Labels can also be 


used for data discovery, see the data discovery section for more. 
TK: to reduce routine work 


Standard Pre-Labeling 


Let’s imagine we have a model that is sort of good at predicting faces. It 
usually gets good results, but sometimes fails. To improve that model we 
may wish to add more training data. The thing is - we already know that our 
model is good for most faces. We don’t really want to have to keep 


redrawing the “easy” examples. How can we solve this? 


Pre-Labeling to the rescue! Let’s unpack this diagram Fig 5.1. First some 
process runs, such as a model prediction. Humans then interact with it. We 
then typically complete the “loop” by updating our model with the new 


data. 


Initially the user reviewing the data is shown all 5 predictions. A user may 
directly declare a sample to be “valid” - or the system may assume that any 


samples not edited are valid by default. However it’s marked, the net result 


is the second stage - 4 correct examples and 1 incorrect that needs to be 
edited. 


The general theme of this method is that over time it focuses the human 


user on the hard cases, the “net lift”. 


Run A Process Update Process 


Run a model and predict Human Reviews and Update Model With New 
labels or spatial locations corrects Data 
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Figure 5-1. Pre-Label Overview 
Benefits 


Why is pre-labeling so popular? Well there’s a few benefits 


1. It’s fairly universal - it’s not tied to any specific method, and can work 
across many domains 

2. It’s relatively easy to implement and understand. 

3. It indirectly acts as a quality assurance control. People are looking at the 
actual predictions. This relationship to the actual model can be very 
valuable. 


4. It works on both meaning and spatial. 


Caveats 


Requires an existing relevant model 


The primary caveat is that you need an “initial” model. In general, pre- 
labeling is a “secondary” step in the process. This can be offset that some 
methods and contexts require very small amounts of data - for example 
maybe even a handful of images and you can start - but it’s still a blocker. 
Transfer learning does not solve this - but it does help reduce how many 


samples are needed to start using this new model. 


Can introduce new errors modes 


Instead of looking at data with “fresh eyes” it may be easy to miss 


something that “looks good” but actually is wrong. 


Sometimes can be slower 


Editing may be slower than creating a new thing for some cases. If a model 
is predicting a complex spatial location and it requires lots of time to 


correct, it may be faster to simply draw it “right” from the get go. 


Requires some setup or reliance on “built in” methods 


While some tools offer built in learning in general it’s best to use our 
existing model. What’s the point of “correcting” a model that’s not actually 
going to be used? The downside is that wiring up the connections (to import 


to the annotation tool) can take a bit of setup time. 


Technical Setup Notes 


TK 


Micro Model Pre-Label 
Pre-Labeling with a different model. 


An extension of pre-labeling is to use a model to predict something it 
knows, and use it as a starting point for predicting something unknown. 
Essentially this divides a responsibility (eg spatial) into some existing 
model that we don’t plan to update, and focuses our effort on adding new 


information. 


The micro model gets discarded - it’s only used within the scope of 


annotation automation - saving time on the dimension it’s good at. 


Usually a micro model is not trained appropriately to be used as the actual 


primary model. 


For example, imagine we have a face detector that’s great at predicting 
faces - but doesn’t know anything else. We can run that model first, get the 
Spatial locations of the faces, and then add the new information like happy, 


sad, etc. 


While it may seem subtle, it’s a big difference from standard pre-labeling. 
We aren’t specifically trying to correct the spatial locations, we are strictly 


using it as a time saver to get to labels and attributes. 


Run a Model that's Good Human adds new Model learns from new 
at Spatial Locations information information 


Add New Label Information 


(model only knows = ——+ 
location) Spatial Information Predicted 
(and kept unmodified) 


Figure 5-2. Caption TK 


Imagine a sports broadcast of a basketball team - identifying and tracking 
every player in each frame manually would be a lot of annotation work. 
There are fairly good “people detectors” that can run, tracking algorithms 
etc. These are responsible for the spatial location. Then the human 
annotation can be focused on the relevant items - like what action the 
players are doing - things that a generic tracking/detector algorithm don’t 


provide. 


Benefits 


Use off the shelf models 


Because our goal here is to add new information (not to correct an existing 
model) - we can use off the shelf models. This means that even if we don’t 
have any training data for our Happy/Sad model, we can get started with an 
existing face detector. This can mean faster setup time. Also because we 
aren’t directly concerned about improving the model, if built-in models are 


available, they can be used directly without drawback. 


Clear separation of concerns 


Either it helps speed up the process - or it doesn’t. 


Caveats 


1. An off the shelf model may not exist - or may require upgrading to be 
useful 
2. Usually not directly helpful for improving spatial detections 


3. Usually not very interactive - fairly static. 


The “One step early trick” 


Speaking to this idea more generally, it’s the concept of taking something 


and getting a prediction “one step early”. 


1. For example, generic object detector without knowing the class. 


2. Generic “event” detector without knowing the class 


Typically this is combined with some other method, eg pre-labeling. 


Quality Assurance Pre-Labeling 


While technically not an automation I must mention one of the most 
common use cases for training data - quality assurance. Essentially this 
means loading existing predictions, often from production models, into a 


system for correction. 


What’s the difference between model Quality Assurance and 
Pre-Labeling? 


For Industry type use cases, the main difference is just the intent behind it. 
There is a similar technical process to load the data, the actual correction 


mechanisms are usually fairly similar in the UI. 


With Quality Assurance generally the intent is to still improve future 


models, and this is again similar to pre-labeling. 


How to get started Pre-Labeling 


1. Generally the cycle looks like: 

2. Identify your existing models 

3. Map the data from the model to the training data tooling 
4. Create tasks using the existing data 


5. Annotate 


6. Export and use the data to retrain new models. 


Interactive Annotation Automation 


Interactive automations to reduce otherwise tedious UI work - eg where it’s 
clear to the human what is correct and it’s just trying to get that data 


understood by the system faster. 


Introduction 


When a user interacts and adds information in order to help the algorithm. 
In general they can make otherwise tedious UI work a more natural 


extension of human thought. 


If I as the user provide the interaction first, then we can get “interactive” 


speed up approaches. 


Consider wanting to get polygon - semantic segmentation - data for 
complex shapes. Instead, what if I could just tell the computer what area of 


the image I’m looking at, and it would figure out the shape? 


Another assumption is that the user is standing by ready to correct. So it’s 
not as if the user does something, then moves on, rather that the user 


“reviews” the output, and modifies it accordingly. 


The key test compared to other methods is that it should be impossible for 
the method to run, without some kind of user input. Whereas say full image 
or full video methods can be run as a background computation without 


direct user input. 


The crucial difference here is that it’s usually not the model of interest 
actually running. In fact many of these methods work without any machine 


learning at all. Popular examples include 


Box to Polygon (Grabcut, DXTER) 


Tracking algorithms 


Contrast, brush, etc. 


Interaction ——_ Script Runs -_—+ Instances 


Input Output 


create_instance: function (data){ 


Figure 5-3. Caption TK 
Pros 


For complex spatial locations this can be extremely effective. 


It usually does not require a trained model, or an existing model. 


Caveats 


Can sometimes be “recreating the wheel”. Eg if you have a model that’s 
good at detecting people - it’s probably better to use that to start then to 


manually identify each person (even if it’s a single click). 


Often requires more user training, patience etc. Sometimes these methods 


can take so long to run that they don’t really save much time. 


To really get these methods working well, it often requires some form of 


engineering work over and above wiring input/output. 


The big idea about creating your own 


Some tools provide integrated methods, or off the shelf approaches. For 
those tools, you may not have access to modify how they work. In those 
case, the following examples will serve as a guide to the inner workings 


behind the scenes. 


Open Source Diffgram provides a compiler to enable you to write your own 
automations. This means you can use your own models, the latest open 


source models, adjust parameters to your desire, etc. 


Technical Setup Notes 


How to try this yourself 


For the scope of these examples I’m using open source Diffgram’s 
Automation library. You can download and try it yourself at: 
https://github.com/diffgram/diffgram. *** The specific reason I’m 
including Diffgram here is that it includes a compiler that can make these 


custom things work. 


The example code is all in Javascript. 


In most cases for the sake of brevity most of the examples are pseudocode. 


Please see the linked examples for full executable code examples. 


Interactive on Drawing Warm up 
TK 


What is a Watcher? (Observer Pattern) 


Any time a user does something, such as creating an annotation, deleting an 


annotation, changing labels etc. we think of it as an event. 
We do so at a semantically meaningful level for training data. 


For example, a Create_instance event is more meaningful than a 
regular mouse_click if I want to do something - such as running a 


model - after the user creates the instance. 


How to use a Watcher: 


We must first define a function that will do something upon a 


create_instance event. 


create_instance: function (data){ // your functirc 
create_instance: function (data){ 
conso Le. Log(data[0] ) 


} 


Watcher create instance example 


Fh watcher create instance example vf + D 
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create instance: function (data){ 
console, log(data[0]) 
} 


Figure 5-4. Example of Automation Interaction UI 


After we enabled this script, and the user draws an example annotation, we 


see the example output in Fig X.xxxx 
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Figure 5-5. TK: Fig X.xxxx Console - Example of Logged Members of an Annotation 


Interactive Capturing of a Region of Interest 


Here is an example of code that captures the canvas region of interest based 
on a User’s annotation. While this may seem fairly simple, it’s the first step 


towards running an algorithm based on just the cropped area. 


Whole Canvas Region of Interest (ROI) 


Figure 5-6. Example of Whole Canvas Vs Region of Interest 


The code to capture this looks like this: 


create_instance: function (data){ 
let ghost_canvas = diffgram.get_new_canvas() 
Let instance = {...data[0O]} 
let roi_canvas = diffgram.get_roi_canvas_from_: 
instance, 
ghost_canvas) 


More generally the idea here is to use some input from the user to pre- 


process what we feed to our model. 


Interactive Drawing Box to Polygon Using Grabcut 


Now, this example will take our region of interest canvas, run the standard 


open cv grab cut algorithm and output polygon points, which will then be 


converted into a human editable annotation. 


Crucially, you can imagine replacing grabCut() with a preferred algorithm 


of your choice. 


Simplified pseudocode: 


let src = cv.imread(roi_canvas); 
cv.grabCut(src,//args) // replace with your mode 
cv.findContours(//args) // because it’s a dense r 
CV.approxPolyDP(//args) // reduce to useful volt 
points_list= map_points_back_to_global_reference\ 
diffgram.create_polygon(points_list) 


Full Example 


Full Image Model Prediction Example 
The full example is available here. 


Zooming out from the directly interactive example, we can also run file 
level models. You can imagine having a series of this models available, and 


the user perhaps choosing at a high level which one to run. 


The high level idea is to get the information from the user interaction, the 


raw media, and then run the automation. 


An example in Javascript could look something like this 


this.bodypix_model = await bodyPix. load() 

let canvas = diffgram.get_new_canvas() 

let metadata = diffgram.get_metadata() 
segmentation = await this.bodypix_model.segmentPe 
Points_list = get_points_from_segementation( ) 
diffgram.create_polygon(points_list) 


Example - Person Detection for Different Attribute 


Now, going back to a more simple example, we can also use these 


automations to simply run models. 


Here, we used the BodyPix example in a diffgram userscript. It ran on the 
whole image and segmented all the people. Here we now select the person 
to add our own attribute “On Phone?”. This is an example of using a model 
that’s not directly our training data goal to avoid having to “recreate the 
wheel” for parts that are well understood. You can imagine swapping this 
model for other popular ones, or using your own models for the first pass 


here. 


Instances & Attributes 
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Figure 5-7. Caption TK 


How to get started with Interactive 


TK 


Quality Assurance (QA) Automation 


Novel Training Data QA tools to debug ground truth data and reduce 


Quality Assurance costs. 


Using the Model to Debug The Humans 


The big idea here is that models can actually surface ground truth errors. 
Sounds crazy? So the way to think about this is essentially that if a model 
has many examples of truth, it can overcome small errors. Intuition here is 


sort of like how we can sometimes spot errors in a quiz answer sheet 


because even though it says it’s the answer, we think about it and feel 


confident that it’s not the answer. 


In practice, one way to do this with images is to use Intersection over Union 
(IoU). If the model prediction doesn’t have a nearby ground truth, then 
either the model is wrong, or the ground truth is. This can be surfaced by 


applications as a ranked list. 


(Identifying the biggest delta between predicting and existing groundtruth) 


Automated Checklist Example 


While this may sound simple a checklist is an essential part of the QA 


workflow. 


This can literally be a checklist for human review, or if possible can be 
programmed. Some software may already do some Tools may include some 
of these by default, but often the best checks are set up more as “test cases”, 
where you define parameters relevant to your specific data (like a 


Checksum). 


1. Is the count of samples reasonable? 

2. Are the spatial coordinates reasonable? 

3. Are all attributes selected or approved by human? 
4. Are all instances approved by human? 


5. Are there any duplicate instances? 


6. (Video specific) Tracks where an instance is missing. Eg in frame n, 
missing in frame n+1, reappears in frame n+2, and not marked as end 
sequence 

7. Dependency type rules, eg if class A is present, then class B should also 
be present, but class B is missing. This is different from heuristics in that 


it’s usually more about reviewing existing instances vs generating them. 


Checks based on looking at the data of samples 


1. Reasonableness check. For example, if the class is a “person”, and 95% 
of the pixels in the spatial location of the sample are all a similar color, 
that’s unlikely to be correct. Or if the majority are in a different 
histogram distribution than what’s expected (eg an object that is never 


red has red pixels) etc. 


Data Discovery - What to Label Exploration 


to help identify what to label. 
1. Identifying outliers and filtering similar data without running our model 


The main difference between choosing based on data is it’s actually looking 
at the data itself. Eg this images visually is similar to this other one. Where 
as the metadata does not need to look at the actual image. Eg this was taken 


at night, and this was taken at day. 


Choosing Based on Data 
TK 


Human Review 


All methods involve some form of human review at some point. If you 
don’t choose any method you are likely in essence doing human review by 


default. 


Human review is often a good thing - even if it’s not automated. Some 
automations make fairly careless assumptions about the data that may not 
be relevant to your case. For example if you only have so many medical 


scans of patients, maybe you do want to use 100% of your data. 


While automation approaches for data discovery can be a very important 
part of the process, I think of it as a more advanced topic, and that the 
basics of understanding your data with human review is the first step. 
Usually the best reason to know you need to use data discovery is that your 
raw data volume so clearly exceeds any reasonable ability to manually look 


at it. 


Pre-Label Based 


One commonly overlooked method is to use an existing network to pre- 
label at the whole file level. For example, imagine the intent is to label only 


day time images, but for some reason the metadata of the data is 


unavailable. You can run a network to tag images as “nighttime or 


daytime”, and then only label the daytime images. 


Similarity Based 


The long story short of that is you get similar images identified, eg each 


batch of these is similar so can sample from each. 


And if you already know you are low on an image of a certain type you can 


also try querying based on that image. 


Nearest Neighbor Plot 2 
Example Image 


Nearest Neighbor Plot 3 
Example Image 


Figure 5-8. Example of similarity comparison. 


Choosing Based on MetaData 


So for example if you want to do something like this automotive company 


does you can query based on metadata 


like time of day, microscope resolution, position, etc etc. etc. 


Make sure to add the meta data during ingest, and then can query it on the 


explorer tab. 


This is more user directed, but can still dramatically reduce the amount of 


data you need to label 


“name: “cipv-low-wis” 
“requester”: “img-vid-cipv-low-vis-{{seq)}". 
“description”: “Low visibility with a Civ", 
“query”: { 
"“Sana™: [ 
("Sea™: [ 
("Sdecimate” 
{*Scorv" 
(*$and": [ 
("Seq”: ("{(active-gear))", 4)), // in drive 
("S$rot™: "“@VisionSceneTags main.scene_tag_array[13) activated"), 4/ GARAGE DOOR_CLOSED 
{"S$rot™: “@VisionScene Tags main. scene_tag_array[15) activated"), // INDOOR 


{"Sgt": "@TelemetryOutput distance_travelied_m”, 1000). 


{"$not*: “@iss_app.right_laneJane_change"}, // No right lane change 


{"Snot™: “@iss_appJeft_lanelane_change"}, // No left lane change 
{"Snot”: "@moving_object_output[O}. cutin_active_in_scene”), // No cutin 
("$it™: "@moving_obdject_output([O}) max_region_tag_cutin prob”, 0.1), 


"sie" 0_0 £ > ax_region_tag_cutin prob”, 0.1), 


Figure 5-9. Example of Query TK maybe replace Diffgram one to avoid license issue 


Simulation & Synthetic Data 


Synthetic data shows promise but is not a replacement for real data. In 
general - all synthetic data methods still require some form of real data to 
be useful. That’s why they are best thought of as an automation method on 


top of real data. 


Let’s get the big question out of the way real quick - does simulated data 
work? The super short answer is “Yes, but probably not as well as you may 


be hoping”. 


The next question is “Will it eventually work really well?”. This one is 
actually easier - it’s unlikely. And there’s a simple thought experiment to 


reflect on this. 


To simulate the data, to the level of realism that is required, is in the level of 
hard problems as Artificial General Intelligence. It’s relatively easy to get 
photo realistic looking renders - but there’s a big difference between those 


simulated renders and real life.2 


Typically simulated data, if used well, can provide a small lift in 


performance. 


It can also be sometimes used for situations where it would be impossible, 


or very difficult, to get the data beforehand. 


Another way to think about this is that using a simulation to automatically 


create training data is somewhat akin to using a heuristic. Instead of humans 


supervising and saying what right is, we are back to trying to build a bottom 


up type model. Just one level removed. 


Simulations are not perfect - Training Data still needs 


human review 


Shown in Fig 5.9 is an example from a major company presented publicly. 
They claimed it was “perfect” data. As you can see there is a patch over the 
cross work - part of what the touted as being more realistic - but then they 
forgot to account for that in the training data generation - generating perfect 


crosswalk lines over the imperfect pavement. 


Were they trying to teach it that the black splotch was actually a cross 
work? This doesn’t make sense. Of course perhaps they were meaning to 
post-process it later to project a perfect cross walk. And perhaps some super 
specialized network could theoretically do this. But at least from the 


training data perspective this is just a plain error. 


This is just a small example but it shows that assuming that a simulation 


will render perfect data is far from correct. 


Figure 5-10. Example of training data that was claimed to be perfect but has clear labeling mistakes. 


For the impossible and rare cases, I would really think of it less as 
“automatically creating training data” and more as “automatically creating 


scenes, for which humans can then create training data”. 


For example, a system may be able to automatically identify which pixels 
are from which object. But - if the simulation is sufficiently random, that 
may not be relevant for events. In other words, it’s important to think about 


what the simulation does actually know. 


A way to think about this is if I have a simulation of shelves at a 
supermarket, those shelves will look similar, unless the simulation is 
engineering specifically to render different shelves. But what dimension 
will they be different in? Will those differences be relevant to the 
production data? Who is programming that in? Remember that even subtle 


shifts in the data can have huge effects on the result. 


Clear Uses of Simulation: 


1. Product images. If you already know what the product looks like, that 
should be a clear enough case for a shopping type robot. 


2. Rare scenes of autonomous driving. 


Pros 


1. Simulations can improve performance 


2. Simulations can create otherwise impossible or rare cases 


Cons 


1. The improvement is usually only by a small amount. Think on the order 
of a relative 0-10%. There may be case specific exceptions to this so it’s 
worth some research. 

2. Often the fidelity of simulations is a lot further from the real world than 
may first appear. For example what looks visually good in a video, may 
be drastically different pixel for pixel. 


3. Simulations are often costly to setup, maintain, and operate. 


Things to think critically about when considering simulations 


1. What are we actually simulating? Are we simulating lighting conditions? 


Camera angles? Whole scenes? 


Media Specific 


Many automation methods work well for all media types. Here I cover 
some of the highlights of how popular methods intersect with specific 
media types. Then cover some common domain specific methods. This 


does not cover all media types. The main intent of this section is to: 


1. Give awareness to the relationship between specific media and 
automation types 


2. Provide an introduction to some of the domain specific types 


What methods work with which media? 


A Y indicates that usually the method will match. For example using pre- 
labeling with data discovery approaches for images is better defined then 
for video. I have included more extensive footnotes here to expand on 


rationale where required and keep the chart easy to read. 


As you can see, generally most of these methods can stack well together. Of 
course each method takes a degree of work and understanding. There is no 
requirement to use all of them. You can have a very successful project that 


completely skips a lot of these. 


Table 5-4. Media formats commonly used with each technique 


Methods / Video Image 3D Text 

Data Type 

Pre-Labeling ¥ v Vv Vv 

Interactive v3 v v Sometimes” 
Automations 

Quality Research Area ¥ Vv v 

Assurance 

Automations 

Data SometimesS Vv Research Area 

Discovery 

Augmentation Sometimes?  v Research Area Research Aree 
Simulation & ¥ Vv ve Research Aree 
Synthetic 

Media Varies - Check Method 

Specific! 

Domain Few Sometimes® Few commonly used method: 
Specific commonly 

Special used methods 

purpose 

automations, 


like Geometry, 
Multi-sensor 


approaches. 


May take more effort to set up. 


' There are a variety of “real time” model training concepts, but often this is 


user focused. 


- If the method requires conversion to images may prevent use of video spec 


automations. If possible to think in terms of discovering segments, and ket 
may improve compatibility. 
Many augmentation methods are focused on images, so may not be applice 


detection or other types of motion prediction tasks. 


- Because most simulations are 3D by default. 


- Like Object Tracking (Video), Dictionary (Text), AutoBorder (Image). Bec 


methods are unique to each media type, double check when choosing a spt 


Many post-processing methods but usually not in context of training data 


Considerations 


Just because you are using a certain data type, eg Video, doesn’t mean that 


an automation will be available for that case. For example maybe you have 


no problem identifying people in videos, but identifying specific 


interactions may be harder. 


Video Specific 


If you are interested in event detection type actions, some of these 


automations are not recommended. For example object tracking, would 


confuse a situation where you want to know exactly which frame an event 


occurs. 
Object Tracking 


The big idea here is to look at the data itself, and track an object throughout 
multiple frames. There are a variety of tracking concepts, many need only a 


single frame to provide an effective track forward through time. 


Interpolation 


The idea here is humans create keyframes and the in between data is filled 


in. 


Keyframe - 1 Interpolated frames 2 through 9 Keyframe - 10 


Figure 5-11. Example of Interpolation 


Polygon and Segmentation Specific 


TK 


AutoBordering 


I covered how to use AutoBordering in Chapter 3. Here I briefly want to 
State that using techniques like that to snap to edges, as simple as it may 


seem, is a great use of practical UI automations. 


Language (NLP) Specific 
TK: 
Heuristics (covered earlier in this chapter) 


Dictionaries 


Augmentation 


Augmentation is modification of real data. For example skewing existing 


images, changing the brightness, introducing artifacts, etc. 


In general there seems to be a consensus towards doing augmentation at 
runtime - and more minimally then the exuberance of early approaches. To 
put it simply augmentation is often a crutch, or provides a relatively small 
lift. It can still be useful and it’s good to be aware of when and how it can 


be used. 


Like simulated data, augmentations generally can provide some form of lift 


on the data. But think in terms of 0-10%. 


Better Models are Better than Better Augmentation 


One general theme is that as the models and algorithms get better, 
augmentations become less effective. This may not always be true but is a 


good rule of thumb. 


A simple way to think of this is that there are nearly infinite subtle 
combinations of data values, be it text, pixels, etc. So trying to train on ever 
growing size of subtly different data is a losing proposition to just having a 


better understanding of what it is. 


Think about it from the human perspective, I don’t need to see 100 different 


variations of essentially the same thing in order to recognize it. 


To Augment or Not To Augment 


The first thing to appreciate with any form of Augmentation is that it 


significantly alters the playing field. 


Next, the best evaluation of any augmentation method is what “net lift” it 


provides. This is sometimes tricker to measure then is first apparent. 


For example, imagine a system that “pre labels” instances. The Spatial 
location is pre labelled reliably. Expert users then add additional 
information, such as what type of crop disease it is. The “net lift” in this 
case, is primarily time savings. So if the cost (literal cost, complexity, etc) 
to setup the pipeline, run the model, etc. is low enough then, every spatial 
location that’s pre-labeled is a direct time savings. In practice, if a robust 
model is available for this, or is already being run in production etc, then 


this is valuable. 


Alternatively consider some negative scenarios. 


An approach (eg superpixel similar) is used to more “quickly” label spatial 
locations. Except, it introduces artifacts and hard to correct errors. It’s easy 
for it to look “ok” at a glance, so these errors go unnoticed until significant 
time is spent on the dataset. This may cause rework and/or hard to 


understand production errors. 


A pre-label approach that requires constant correction. Correction is ~3x 
slower than just drawing the spatial location correctly the first time. So, in 
the abstract, the pre-label must be very accurate to break even, and more so 


to provide net lift. 


Any scenario in which an “external” model is used. It introduces bias into 
the new training data - polluting it so to speak. As a practical example, 


imagine a person detector that appears to reliably detect people. It’s used as 


the “spatial location” first pass, to which your customer Labels are then 
added. If that detector has a bias, perhaps favoring an ethnic majority, it 
may be very difficult to spot, since it’s “hidden”. You may need an extra 


QA step expressly to check for this potential. 


At a high level, the general rules of thumb for these augment approaches 


are: 


Explore with caution - there are always trade offs and augment methods 


hide the trade offs well 


Consider the “net lift”, taking into account costs such as the added 


complexity, reduced flexibility, literal compute/storage costs etc 


Many of the approaches are very specific - it may work very well for one 


case, of one spatial type, etc. while failing on many others 


Other general observations 


The most effective options generally seem to favour spatial types over class 


labels 


Generally it’s best to create a “Seed” set, before exploring any type of 


‘model’ based approaches there 


Sometimes “Augmented” training data is somewhat hard to avoid, eg in the 


case of ‘correcting’ a production pipeline 


Runtime Augmentation 


It’s worth keeping in mind that in general, any type of augmentation must 
be reproducible. In that context, it’s usually more compute and storage 
effective to have the literal images created at training time, than stored as 


part of normal training data. 


Patch and Inject Method (Crop and Inject) 


The concept here is to take patches - meaning crops or subsections - of the 
data and creating novel combinations. For images, you can image placing 
rare classes in a scene. This is sort of like a hybrid simulation where real 


data is simulated in different places. 


There is little research on this. In general - a model should be robust to 


these types of scenarios. 


Domain Specific 


TK 


Geometry Based Labeling 


In some cases, you can use known geometry of a scene, sensors, etc. to use 
geometry based transformations to automatically create some labels. This 


case is highly dependent on the specific context of your data. 


Multi-Sensor Labeling Automation - Spatial 


The main idea here is to use mathematical projections to assume where 
something is in space based off where the sensors physically were. For 
example, if you have 6 cameras, it can be possible to construct a virtual 3D 
scene. Then you can label 1 camera, and project that into the other 5. Often 


this means projecting into 3D and back into 2D. 


Note this does not require a 3D oriented sensor like LIDAR or RADAR. 


But in general it’s considered easier if one of those sensors is present. 
Pros 


1. In the most optimal case, you get something like 5:1. Keep in mind you 
still need to review and often correct the projections, so in reality it’s 


more like 3:1 at it’s best. 


Cons 


1. It requires having multiple sensors, extra metadata, the ability to project 


into 3D 


Spatial 


This is similar to the multi-sensor labeling, but is more focused on cases 


where large extents of geometry focused concepts. Like lane lines. 


Heuristic Based Labeling 

In general heuristic labeling has been NLP focused. 
Dictionary based labeling 

User defined heuristics 


This is a controversial area. The main problem is that the more well defined 
a heuristic is, the more it looks like we are coding and doing work there, 
and basically re-inventing feature engineering. Since one of the whole 
points of deep learning based approaches is automatic feature engineering, 


this essentially is self defeating. 


Now this is not to say heuristics based methods are without merit - far from 
it. There are some incredibly interesting research in this area, especially for 


text based applications. 


If you have written a perfect set of heuristics, then why even train a 


machine learning model at all? 


For experienced folks, grading production predictions is also rolled into this 


concept 


Anecdotally, I have run this by both research and industry oriented people 


and there’s usually common agreement on it’s truth. 


Chapter 6. Tools 


A NOTE FOR EARLY RELEASE READERS 
With Early Release ebooks, you get books in their earliest form—the 
author’s raw and unedited content as they write—so you can take advantage 


of these technologies long before the official release of these titles. 


This will be the 6th chapter of the final book. Please note that the GitHub 


repo will be made active later on. 


If you have comments about how we might improve the content and/or 
examples in this book, or if you notice missing material within this chapter, 


please reach out to the editor at jleonard@oreilly.com. 


Introduction 


Choice and variety are abundant for training data tools. There are so many 
options that choosing the most appropriate one is often the hardest choice. 
Modern training data tooling varies from expansive platforms to special 
purpose tools. Here I’ll talk about the scope of some of these tools, from 


individual learning to large teams. 


Why do we have training data specific tools? These tools are driven by the 


burning needs of teams working with training data. For example, when I 


first came up with the earliest version of Diffgram, I had wanted my 
teammate who was working remotely to help annotate and none of the 
existing tools could do that - they all had to be installed locally for every 
user. More recently, with an increasing volume of options of tools to choose 
from, teams have been frustrated by having to string together many thin 


tools, and had a great desire for union of an all in one application. 


What can these tools really do? When you buy a car you may not be sure 
exactly how it will drive, but you are pretty confident you will use it to get 
from point A to point B. What is A to B really in Training Data? What 
benefits do you get automatically from using training data tools? What do 
you have to work for? What does it look like when you are fully up and 


running? 


Reading this chapter you will walk away with a clear understanding of the 
general segments of tools available. From this high level map of the 
landscape I will dive into trade offs focused around different scopes of 
goals. Naturally it will be necessary to consider scale, queries per second, 
teams, and data types. I will explore the concerns top of mind when 
planning real world systems. Starting from single users, then small teams, 


and working up to large teams at major companies. 


Open Source or closed source? Self Install or Software as a Service? These 
arguments are as old as time. I will lend a perspective scoped to training 


data and the nuances around it. I will also cover training data customization, 


security, open source, deployment, costs, ease of use, installation, hardware, 


configuration, bias, myths, and metadata. 


At the end of this chapter you will walk away with a clear understanding of 
the most critical concerns, key trade-offs to consider, and a high level 


understanding of where these tools can take you. 


Why Training Data Tools 


We have databases to smoothly store data. Web servers to smoothly serve 


data. And now training data tools to smoothly work with training data. 


I say smoothly because I don’t have to use a database. I could write my data 
to a file and read from that. Why do I need postgres? Well because Postgres 
brings a vast variety of features, such as guarantees that my data won’t 
easily get corrupted, that data is recoverable and that data can be queried 


efficiently. Training Data tools have evolved in a similar way. 


Below I will offer a progression of thought that mirrors both what happened 
in the industry and will hopefully be relevant to what you are going through 


now. 


TOOL 

I will often use tool as the word even when it may be a larger system or 
platform. By tool, I mean any technology that helps you accomplish your 
training data goals. While for example I consider Diffgram to be a platform, 
it’s also a tool. When it matters I may differentiate otherwise I tend to use 


tool for simplicity sake. 


Employee time in any form is often the greatest cost center. Well deployed 
tooling brings many unique efficiency improvements, many of which can 
stack to create many orders of magnitude improvement in performance. To 
continue the database analogy, think of it as the difference between 
sequential scans and indexes. One may never complete while the other is 


fast! Training data tools upgrades you to the world of indexes. 


What do Training Data Tools Do? 


From the strategic view training data tools are a prerequisite to ship your 
machine learning systems. Zooming in slightly, training data tools 


encompdss: 


e Interfaces to do literal annotation 
¢ Tools to empower people, processes, and data in the context of human 
computer supervision 


e Bring clarity and surface overall training data concerns 


To better understand this in practice, let’s go back to our database analogy. 
Do I feel forced into using a database? While I may know I essentially have 
to use a database, I can focus more on the benefits to understand why I 
should be using one. “Wow, postgres allows me to store millions of records 


and query them in milliseconds!” 


Training Data tools similarly provide many benefits that go far beyond 
handling the minutiae. For example “Wow my data scientists can query the 
data trained by Annotators, saving having to download huge sets and 


manually filter locally”. 


Best practices and levels of competency 


Becoming highly competent with training data tools will take you at least as 


much work as it took to learn DevOps. 


Speaking for myself I am still learning DevOps years later ... so it’s worth 
considering that your training data learning may be more like a journey than 


a destination. 


I bring this up to a level set that no matter how familiar you are, or how 
much time you spend there is always more to learn about training data - I 


am always learning myself! 


Human Computer Supervision 


You may be familiar with the concept of “human Computer Interaction” 
(HCI). To me this means how I as a user relate to a program. Is it easy to 


use? How do I interact with it? 


With Training Data I would like to introduce a concept called Human 
Computer Supervision (HCS). This idea is that you are supervising the 
“computer”. The “computer” could be a machine learning model, or a larger 
system. The supervision happens on multiple levels, from literal annotation 


to approving datasets. 


This is contrasted with Interaction where the user is the “consumer”, instead 
the user is the “producer”. The user is producing supervision which is 
consumed by the computer. The key contrast here is that usually with 
computer interaction it’s deterministic. If I update something I expect it to 
be updated. Whereas with computer supervision, like with human 
supervision, it’s non-deterministic. There’s a degree of randomness. As a 
supervisor I can supply corrections, but for each new instance the computer 


still makes its own predictions. 


For the sake of space I won’t dwell on this distinction, and if it’s not clear 
right now don’t worry about it. Over time the general idea here is to keep 
differentiating between this new form of human computer supervision work 


and regular computer usage. 


Tools Bring Clarity 


Training Data tools are a means to effectively ship your Machine Learning 
product. As a means to this complex end, Training Data tools come with 
some of the most diverse opinions and assumptions of any area of modern 


software. This is No-SQL vs SQL era arguments on steroids. 


Tools help bring some standardization and clarity to the noise. They also 
help bring teams that may otherwise have no benchmark of comparison, 


rapidly into the light. 


e¢ Why manually export files, when you can stream the data you need? 
e Why have different teams storing the same data with slightly different 
tags, when you can use a single unified data store? 


e Why manually assign work when it can be done automatically? 


Understanding the Importance of Tooling 


Not having the right training data tools would be like training to build a car 
without a factory. Achieving a fully setup training data system can only be 


accomplished through the use of tools. 


We have a tendency to take for granted the familiar. It’s just a car, or just a 
train, or just a plane. All engineering marvels in their own right. We 
similarly discount what we don’t understand. “Sales can’t be that hard” says 
the engineer. “If I were President” etc. I have found many people do not 


understand the breadth of training data tooling. 


One way to try to picture it: 


A Training Data tooling is: 
Photoshop + Word + Premiere Pro + Task Management + Big Data 


Feature Store + Data Science Tools 


That’s a bold statement. Let’s unpack it. 


e People expect to be able to annotate as well as the best drawing tools 

e People expect modern task management like a dedicated task 
management system 

e They expect to be able to ingest and process huge amounts of data, 
effectively an Extract Transform Load tool in it’s own right 

e And on top of all that to conduct analysis on it like a business 


intelligence tool 


All in one system. There are very few comparables to this! 


There are very few systems where 


¢ The literal work, and the management of the work, get done in the same 
system 

e The system spans many users and distinct roles 

e The system does many distinct functions, but none of the functions 


interact meaningfully besides reports 


Whereas in a training data tool 


1. The literal work of annotation, and the task management are rolled into 
one system 

2. A suite of interfaces comparable to the adobe suite or office is rolled into 
one system 

3. The systems many distinct functions must work in concern with input 
and output directly used in other systems, plus integrated data science 


tools 


Realizing the Need for Dedicated Tooling 


As an industry when we first started working with training data, the first 


rush was to just “get it done” to start training models. 


The questions were along the lines of “What was the most minimal bare 
bones UI for a human to jam annotations on top of data and then into a 
format a model could use?” This is when people first started to realize the 
power of modern machine learning methods and just wanted to see “will 


this work?” “Can it do that?” “Wow!”. 


The problems came swiftly. What happens when we move the project from 
research to staging, or even production? What happens when the annotator 


is not the same person writing the code, or even located in the same 


country? What happens when there are hundreds or even thousands of 


people annotating? 


At this point, people often start to realize they need some form of dedicated 
tooling. Early versions of training data tools answered some of these 
questions, allowing remote work, some degree of workflow, and scale. 
Quickly however more questions enter the picture as pressure on the system 


increases. 


More Usage, More Demands 


To put this very plainly, the moment you have a significant amount of 
people spending their full working day, eight hours a day every day ina 


tool, everyone’s expectations and the pressure increase. 


Iterative model development, for example pre-labeling, puts pressure to 
continually improve training data. While this is desirable it puts increased 
pressure on the tooling. Because the more often automation approaches are 
used, the more pressure. Static pre-labels are just the tip of the iceberg. 
Some automations require interactions, further stressing interactions 


between data science, annotators, and annotation tooling. 


Many features have been added to address these needs. As tooling providers 


added more features, the ability to have a smooth workflow became a new 


issue. Too many features, too many degrees of freedom. Now the 


responsibility to limit the degrees of freedom has increased. 


Advent of New Standards 


Tooling providers have now had some years of experience and learned 
many things. From creation of new named concepts specific to training data 
to multitudinous implementation details. These off-the-shelf tools make the 
overwhelming into something manageable. This enables you to use these 
new standards and work at a level of abstraction that’s relevant to you and 


your project. 


Yes, we are at the early stages of standard training data. As a community we 
are developing everything from conceptual ideas like Schema, to expected 
annotation functions, to data formats. There is some agreement on what the 
scope of training data tools is and what standard features are, but there is 


still a way to go. 


To understand the vast space involved in this consider Fig 6-1. On the left 
axis you can see the various types of media. There are about 9 major kinds. 
Then the reader may be familiar with the 9 major areas of interest as 


covered in chapter 4. 


Ingest Store Workflow Annotation Annotation Stream to Explore Ocebug Secure & 
Automation Training Data Private 


Figure 6-1. Landscape of Training Data Tooling 


While there is naturally some overlap, most of the functional areas have 
differences depending on the media type. For example, automations for 


text, 3D, and images, are all different. 


The realization here is that bespoke rube goldberg machines may answer 
some of the complexities but fail to cover the vast space needed. The 
progression of some tooling providers looks like Fix 6-1. Putting aside any 
historical interest, as someone making a decision today, the context of the 


progression helps ground where the value is coming from. 


To make the most use of Fig 6-1 I like to think of this as the 30,000 foot 
view. So if you are thinking about an automation improvement, it’s worth 
reflecting if it will apply to all of the media types that are relevant to you. 
It’s a reminder that any weakness in one area is likely to create a bottleneck. 
If it’s difficult to get the data in and out the value of a great annotation 


workflow is diminished. 


Journey to the Suite 


Where are you in your journey of needs? Do you already see the need for 
dedicated tools? For the best quality tools you can get? For a suite that 


covers the vast training data space? 


We all like familiar things. In the same way that office suites offer a similar 
set of expectations and experiences, from the UI to naming conventions, 
training data platforms aim to do the same. To create familiar experiences in 


multiple formats, be it text or images. 


Naturally at any given moment a single team may be focused on a specific 
data type or types (multimodal). The familiarity here helps well beyond 
this. New people joining the team can more quickly get up to speed, shared 


resources can go between projects more easily and more. 
As shown in Fig 6-2 generally the progression goes from 


1. Realizing the need for dedicated tooling 

2. Realizing the complexity of the technical space requires the best possible 
tooling - not just anything 

3. Realizing the complexity of the user space requires familiarity and 


shared understanding 


Familiar Experience 
Across Media Types 


Thick integrate value 
and customizable 
Addresses flexibility 

Technical 

Addresses both 


Relatively 
technical problem 


Lightweight 


problem space 


space, and user 


expectations 
Bespoke 7 
: Spans a significant 
—- percent of the 9x9 
Grid 


Figure 6-2. Progression of training data providers 


As I will explain in more detail in Chapter 7, if you are considering having 
a director of training data position established, having familiar tooling is of 
critical importance to this team. The same annotator may easily shift 
between multiple types of media and projects. This also helps address the 


difference between data science concerns. 


To differentiate a potential confusion. Having a suite does not mean an “all 
in one” solution for everything. Data science may have it’s own suite of 
tools for training, serving etc. It also does not preclude point solutions for 


specific areas of interest. It’s more like an order of operations idea, we want 


to start with the biggest operation, the main suite, and then supplement it 


where required. 


Open Source Standards 


By my estimate in 2017 there were likely less than 100 people in the world 
working on commercially available tools for training data. In 2022 there are 
over 1,500 people across at least 40 companies working directly for training 
data specific focused companies. Sadly the vast majority of whose 
individuals work on separate projects in closed source software. Open 
Source projects like Diffgram offer a bright future of shared access to 


Training Data tools regardless of financial status of the country of living. 


Open Source tooling also shatters illusions around what is magic and what 
is standard. Imagine spending more budget for a database vendor that 
promises 10x as fast queries, only to find out all they do is inject extra 
indexes. Now in some cases that could have value, but you would at least 
want to upfront know you were paying for ease of use, not the concept of 
indexes! Similarly training data concepts like pre-labeling, interactive 
annotation, streaming workflows etc. are brought to the forefront. More on 


this later in the chapter. 


A paradigm to deliver machine learning software 


The same way the DevOps mindset gives you a paradigm to deliver 
software, Training Data mindsets give you a paradigm to ship Machine 


Learning software. To put it very plainly these tools: 


1. Basic functions to work with training data, like annotation and datasets. 
Things that it would be otherwise impractical to do without tooling. 

2. Provide guardrails to level set your project. Are you actually following a 
Training Data mindset? 

3. The means to achieve training data goals like managing costs, tight 


iterative time-to-data loops and more. 


A mental picture I like to think about is standing at the base of a large hill 
or mountain. From the base, I can’t see the next hill. And even from the top 
of that hill, my vantage is obscured by the next, such that I can’t see the 3rd 


mountain until I traverse the 2nd, as shown in Fig 6-3. 


For fans of the video games, this is like the tech tree, where subsequent 


discoveries are dependent on earlier ones. 


Figure 6-3. Sight-lines over Hills, I can only See The Next Mountain. (Sequentially Dependent 
Discoveries) 


Training Data tools help you smoothly traverse these mountains as you get 
them, and in some cases even “see around corners” and give you a birds eye 
view of the terrain. To make that concrete, I only better understood the need 
for querying data, once I realized that over time, the organization 
approaches used to annotate data, often don’t align with the needs of a data 
scientist, especially on larger teams. This means that no matter how good 
the initial dataset organization is, there is still a need to go back afterwards 


and explore it. 


Training data tools are likely to surprise you with unexpected opportunities 
to improve your process and ship a better product. They provide a baseline 
process. They help avoid thinking you have re-invented the wheel, only to 
realize an off the shelf system already does that and with a bit more finesse. 
They improve your business key performance indicators by helping you 


ship faster, with less risk, and with more efficiency. 


Now of course it’s not that these tools are a cure-all. Nor are they bug free. 
Like all software they have their hiccups. In many ways these are the early 


years of these tools. 


Scale 


Engineers love to talk about scale. For readers who aren’t in engineering, 
think of how differently Disney World operates from a local entertainment 
center like an arcade. What works for Disney doesn’t work for the arcade 


and vice versa. 


As I covered in Automation Chapter, at the extreme end of the scale, a fully 
setup Training Data system allows you to retrain your models practically on 
demand. Improving the speed of time-to-data (the time between when data 
arrives to when a model is deployed) to approach zero can mean the 


difference between tactical relevancy or worthlessness. 


Often the terms we are used to thinking about for scale with regular 
software aren’t as well defined with supervised Training Data. Here I take a 


moment to set some expectations with scale. 


Why is it useful to define scale? 


Well first to understand what stage you are in to help inform your research 


directions. Second, to understand that various tools are built for different 


levels of scale. A rough analogy may be sqlite vs postgres. Two different 


purposes from the simple to the complex. 


Speaking of complexity, a large data discovery tool may not be relevant if 
you are doing a small project and really plan to use 100% of your data 


anyway. 


Alternatively, for mid scale and up, you may prefer your team goes through 
a few hours of training to learn best practices of more complex tools if they 


are going to work on it day in and day out. 


So what makes this so hard? Well for starters many big companies keep the 
deep technical details about AI projects pretty hush hush, to a degree 
noticeably different from more common projects. Another reason is that in 
the wild many public datasets don’t really reflect what’s relevant for 
commercial projects, or a mislead, for example they may have been 


collected at a level of expense impractical to a regular commercial dataset. 


Rules of Thumb 


Keeping in mind that Small projects may still have very real challenges. 


Item & Scale 
Direction 
Volume of 
Data at Rest 


(Annotations) 


In a sliding 
window 
period 
Data types 
supported 


People 
Annotating 
(Subject 
Matter 
Experts) 
People with 
data 
engineering, 
data science, 
etc. etc hats 
Revenue 
impacted 


System Load 
(Queries per 
second) QPS 
Chief 
concerns 


Small Medium 
Thousands Millions 
Only needs 


support what’s 
relevant to 


your use Cdse 


A single A medium sized 
person or team 

small team 

A single A team of people 
person data hats 

No $ amount Millions of dollars 
attached or impacted by work 
pre-revenue 

<1 QPS <30 QPS 


Ease of gettingEffectiveness of 


started and _ tooling, Support 


Large 


Billions 


Most likely needs to 
support all 


Multiple teams of 


people 


Multiple teams of 


people 


100s of Millions or 
Billions of dollars 
impacted by work 
1000s of QPS 


Volume of data 


(“Scale”), 


ease of use _—_and uptime of tools, Customization, 

Cost of toolingStarting to think Security, Inter-team 
(may be no or about optimization, issues, Assumes each 
low budget for May be planning a team is already doing 
tooling) transition to major optimizations they 


scale are familiar with. 


Of course there are many exceptions and nuances but if you are trying to 


scope out the project that may be a good starting point. 


Transitioning from small to medium scale 
Also applies to planning a mid scale system from scratch. 
A few things to think about 


1. Workflow 
2. Integrations 


3. Use of more data exploration tooling 


Also see the section on major scale for further directional planning. Not all 
of those concerns will apply or be actionable yet but it’s good to be aware 


of them. 


Build, Buy, or Customize 


There is a classic debate “build or buy”. 


I think this argument should really be Customize, Customize, or 
Customize? Because there is no valid reason to start from scratch at this 


point with so many great options already available as starting points. 


¢ Many options have increasing degrees of out of the box customization 


e Open source options can be built-on and extended 


For example, maybe your data requirements mean the ingest or database of 
a certain tool isn’t enough. Or maybe you have a unique UI requirement. 


The real question is: 


e Should we do this ourselves? 


e Get the Vendor to do it for us? 


Major Scale Thoughts 


Likely, if you are operating at a major scale there are already systems and 


teams in place. 
A few things to think about as you move forward 


1. What’s the velocity that the data moves through the system? How long 
does it take to go from new data, to upgraded supervised data, to a new 


model? This is similar to devops velocity 


2. Do we really need to build this in-house? Over the last few years the 
commercial tooling market has changed dramatically. What was 
completely unavailable 5 years ago may not be an off-the-shelf option. 
It’s a great time to rethink what unique value add each team is doing? 
Can you more readily customize an ongoing project then do all the 
plumbing yourself? 

3. Do we really have to duplicate this data? Is there a more central way we 
can store this data as it moves around these stages? 

4. Are the concepts and awareness around classically training data 
(discovery systems), relevant to this new form of human supervised 
automation? 

5. How many people does it take to discover and make a correction to a 
model? Human annotator, ML engineer, manager? 

6. Do you have a formal sign off process for datasets similar to sign offs for 
making code commits? What is the threshold for a human reviewer to 
make a “commit” to a dataset? You may already have model deployment 


flows, but start earlier to look at the actual data it’s being trained on. 


Flipping a few assumptions on major scale: 


1. What does the shape of the data really look like? If I need to provide a 
request/response cycle for every single image, audio file etc, then it may 
burst into 1000s of QPS. But does that really make sense? Can the data 


be queried (one cycle) and then streamed? 


2. Are my data governance policies actually getting implemented across 
teams? Are datasets stored with the same awareness of expiration 
controls as individual elements? Is there alignment between teams or 


configuration of tools for this? 


scope 


As this ecosystem continues to evolve there is a broadening boundary 
between the scope of users and data the tooling is designed to work with. 
Some tools may cover multiple scopes. In general tools lean either towards 


single users or truly many users. 


As shown in Fig 6-4 one way to think of this as a continuum with two 


major poles - point and suite solutions. 


laa T 


LZ 


Figure 6-4. Point Solution and Suite Continuum 


Note: Some of these icons are explained throughout the book. Naturally any 
system will have some concept of input output. So when we have an Ingest 
icon, it’s to indicate something that an entire team would work on at a big 
company. Further icons like Secure refers to security products like blurring, 


PII etc. not the general concept of security. 


Point Solutions 


Distinguishing features: 


1. Often mix Training Data and Data Science features. For example, it may 
be pitched as “End to End” or “Get a Model Trained Faster”. 

2. Focus on a single or a small handful of media types. 

3. For single users or small teams. This usage assumption cascades to 
features around who creates labels, ease of use etc. 


4. Software as a service or deployed locally on one machine 


Usage: 


1. Most appropriate usage includes experimenting with an end to end 
demo, or if it works well enough and you don’t have the resources to use 
other options. 

2. Usually, by their nature of being simpler, these tools are faster to set up 
and get a “result”. If it will be the result you want is often more 
questionable. 

3. Having some form of automatic training built in. Automatic training is 
not an automatic negative, however, usually mid sized and larger teams 
want more control and so it must be taken with a grain of salt. 

4. Sometimes point solutions can be of great quality in their specific 


domain. 


Cautions: 


1. These tools usually severely limit - either by technical or intentional 


political limits - what type of results can be achieved. For example, they 


may have a method to train bounding boxes, but not keypoints. Or vice 
versa. This extends to media types too, they may have a method for 
images, but none for text. 

2. Usually are not appropriate for better resourced teams. May be lacking 
many of the major feature areas, such dedicated task workflow functions, 
ingest, arbitrary store and query etc. 

3. Often are not very expandable or customizable relative to heavier weight 
solutions. 

4. Security and privacy are usually limited. Specifically for example the 
terms of service may allow these companies to use the data you create to 
train other models, sometimes projects are public by default if not 
paying, etc. Ultimately there must be trust in the service provider with 
your data. 

5. While the quality may be high, the need to string together the point 
solution with other tools often creates extra work. This is especially 
prevalent in a larger firm where the tool may be appropriate for one team 


but not another. 


Cost Considerations 


1. These types of tools often have a “long tail” of costs. They may have a 
cost per annotation. Or it may be free to train the model, but a cost to 


serve it (and no option to download it). 


Tools in between 


Generally most tools trend towards one of the extremes, either the smaller 
as already mentioned or the larger use cases as I will cover in a moment. 
There are also some tools that are somewhat in between either of those 


poles. 


Generally the progression to look for is: 


1. More awareness of training data as a separate, stand-alone concept. 

2. More awareness of multiple solution paths. Less “one true path” and 
more flexibility. 

3. Greater percent of landscape coverage. For example, may have more 
workflow functions. Human task management concepts. 

4. More enterprise friendly concepts. May offer local installation, or 
customer controlled installations. More focus on customization and 
function over golden path mentalities. 

5. There may be some contractual guarantees around data added. 

6. These tools may be able to provide serious results and be appropriate if 
your team has outgrown smaller tools but doesn’t yet have resources for 


larger tooling. 


A suite is not automatically better. However, it is usually hard for smaller 
tools to “step up” to a higher level whereas most larger systems can often be 


used only in part and they fit this middle path quite well. 


Platforms and Suites 


For mid-sized, large teams and companies with multiple teams 


From a very high view point the main differences in psychology with these 


systems: 
#1 View Training Data as a dedicated discipline. 


Even if they have other integrated Data Science products, services etc. they 


draw a Clear line about what’s training data and what’s not. 


#2 Offer a suite of media types and lateral supports. 


Usually you can tell it’s a broder system because it will cover more - or 
even all - of the media types. Similarly for lateral supports like storing, 
streaming to training, exploring, etc. there will likely be more coverage. 
Given the vastness of the space I use the word coverage since even the most 


advanced and largest platforms have gaps. 


#3 Bread and depth 


Further expanding on #2, some solutions may offer great coverage on media 
types, but only to a relatively superficial depth. As a solution leans more 
towards this end of the spectrum it’s depth of offerings in each category 


continue to grow. 


Customization 


The big product difference here is that often these tools assume they will be 
customized, either with more built-in customization options through 
configuration, or with more hooks and endpoints to naturally customize it 


through code. 


Generally, tools designed for large teams and scale 


1. Customization. Virtually everything is up for user configuration, from 
how the annotation interface looks, to how workflow is structured, etc. 

2. Installation. It’s assumed that installations will be done by, at least 
overseen by the customer. Who has the encryption keys, where the data 
is stored at rest, etc. are part of the discussion. Expected dedicated and 
clear security discussion. 

3. Performance expectations and capacity planning are done. Any software, 
no matter how scalable, still requires more hardware to scale. 

4. Many users, teams, data type etc. are expected. 

5. Don’t offer integrated training. Typically because the quality integrated 
training delivery is below the expectation. Typically because a dedicated 


team is dedicated to doing the training. 


Cautions: 


1. These systems can be very complex and powerful. They usually take 


more time to set up, understand, and optimize for your use case. 


2. Sometimes head to head in a specific function the larger system may not 
be fair as well. One reason for this is because what may be a high 
priority for a point solution to fix, may be a much lower priority in the 
scope of a larger system. 

3. Large systems, even with potentially stronger quality controls, have 
more bugs. What may be hard to break in a small system due to it’s 


simpler nature may break in a larger system due to the complexity. 


Where is the Machine Learning? 


The best platforms offer a solution in between the two extremes of “brittle 


single autoML” and “do nothing”. 


Essentially this means focusing on the Human Computer Supervision side. 
How to get data to and from machine learning concepts. How to run your 
own models, integrate with other systems such as AutoML, dedicated 


training and hosting systems, resource scheduling etc. 


Tooling quickstart 


Because Diffgram is open source and fully featured it’s the ideal training 
data platform to start with. You can download diffgram from 


diffgram.com/install. 


For other choices see trainingdatabook.com/tools. 


More broadly, for getting started here’s a cheat sheet for tooling: 


#1 Choose an open source tool to get up and running 
quickly. 


Some tools install in dev in minutes and a moderate production setup in 
hours or a few days. Most have optional commercial licenses that can be 
purchased. There’s no downside to this, you can easily upgrade the license 
after if you need to. This is faster than talking to a sales team and gives a 


truer account then limited SaaS trials. 


#2 Try multiple, choose only one 


One training data tool and one data science training tool. There’s endless 
optimizations and it’s easy to get stuck on early optimization without even 
having the baseline up. The main danger of choosing two is that it’s too 
easy to over-fit for the perceived ease of setup/first impressions, and when 
much of the value is delivered over a long period of time this can lead to 


missed opportunities. 


#3 Use UI based wizards as much as possible. 
Even if you are an elite coder it’s just less mental overhead. 


Setup the tool locally to save on any effort with remote services and costs. 


It’s easy to transition later to a full fledged deployment. 


Training Data Tooling Hidden Assumptions 


Training Data tools bring many benefits and are of critical importance. To 
reap the benefits however you still need to consider these assumptions. 


Some of these are usually True, others usually False. 


Before we cover the details on regular considerations it’s worth being aware 


of these assumptions. 


True: Meet the Team 


Admins, annotators, engineering etc. This is a product that gets touched by 
many people in the organization, often with very different goals, concerns 


and priorities. 


True: You have someone technical on your team 


Someone needs to install, set up, and maintain the system. Even for 100% 
service based tools with the latest wizards there’s still an assumption that at 


least one person is technical and one person understands training data. 


True: You have an ongoing project 


Most of this tooling requires a little bit of setup. This setup may be 


marginally for an ongoing project. 


True: You have a budget 


As we explore in the costs section, even for open source tools there are 


hardware and setup costs. 


True: You have time 


The complexity of some of this tooling is quite astounding. As of 2022 


open source Diffgram has over 1,200 files and 500,000 lines of code. 


False: You must use Graphics Processing Units GPUs 


Training a model often benefits from having a processing accelerator like a 
GPU. However actually using this in automations does not require a GPU. 
Also - training in the context of a limited dataset does not benefit as much 


from GPU power because it’s a smaller set. 


False: You must use automations 


Automations are very useful - but they are not required. 


False: It’s all about the annotation UI 


An annotation editor is still an annotation editor regardless of which brand 
it’s from. Of course quality varies. Be careful not to over represent the 


annotation UI in your comparison process. Like when buying anything, any 


single feature matters to some extent but must be considered in context of 


the whole. 


Security 


According to a 2022 linux foundation report “Security is the #1 priority that 
influences what software an organization will use. License compliance is 


the #2 priority.”4 


Security Architecture 


For high security installations it is usually best to host your own training 
data tools. This allows you complete control to set your own security 
practices. You can control the encryption access keys and location of all 
aspects of the system from network to data at rest. And of course you can 


then set your own custom security practices. 


Attack Surface 


Installation is the starting point since networking is 101 for cyber security. 
The attack surface of an inaccessible network is low. So for example if you 
already have a hardened cluster you can install your software and use it 
within that network. For example, Diffgram can be installed using helm or 
docker. Each Diffgram install has its own cloud bucket and database. This is 


where all media and annotations are stored. Depending on your 


requirements, you may have one installation for all of your projects, or you 


can have a separate installation per project if required. 


Data Access 


Each Diffgram installation has two main data access points 


e The cloud bucket that everything gets deposited into 
e The database 


Inside the application Diffgram allows users to add and configure cloud 
data access based on credentials the user adds. Seperate from this, at 
installation time, a single cloud bucket is defined that is where all media 
ingested from other sources is stored. You can configure this to your desire, 
and also use it as further control mechanism, since changing access here 


will invalidate all raw storage access. 


Human Access 


Diffgram has a role based access control concept. This is setup on a “per 
project” basis. There is a super admin / root role that bypass the majority of 
access concepts. There is no way to create a super admin user unless you 
already have root level access to the system, such as the administrator who 


setup the system. This super admin concept can also be disabled. 


Identity Access Management (IAM) bucket delegation 


schemes 
Bucket based IAM delegation is a bad idea, here’s a small sampling of why: 


1. It is security obfuscation - which means it is not real security. Since the 
application has IAM access to the data, it can at any point access the data 
and store it somewhere else. Therefore in a security event, revoking 
access is only partly effective. 

2. What about network security? What about the annotations (the 
database)? Bucket IAM solves only a small fraction of the problem. 

3. Most IAM schemes generate signed URLS, which are difficult or 
impossible to invalidate after the fact. Further the act of invalidating 
them may involve days of work and breaking changes such as moving all 
files to a different directory. Bucket IAM schemes don’t even solve the 
problem they were supposed to solve. 

4. Data access is only one threat vector. All applications have 
vulnerabilities and all changes can introduce vulnerabilities. An attacker 
may not intend to exfiltrate data. They may instead wish to alter your 
training data. Or deny you access. 

5. It is well established that open source software is more secure over the 


long run then most closed source software. 


In contrast with an installed solution 


1. You can set real security, including all keys, based on your real and 
current security posture. 

2. You control network security, the annotations database, the raw data, 
everything. 

3. You control the entire keychain. 

4. You are aware of the other threats and you can take action such as 
pinning specific versions. 


5. You can choose an open source solution 


Annotator Access 


One of the first things that comes to mind is often annotators’ abilities to 
access samples. Consider a company that has a smart assistance device. 
Perhaps a reviewer listens to audio data when the device misfired and the 


microphone came on accidentally. 


Or consider someone correcting a system to detect baby photos etc. There 


are many levels to consent 
On the consumer side there are generally these big buckets 


1. Do not have consent to use directly (anonymized only) 
2. Do have consent to use to train models - may be limited by time 
3. Have consent, but data may contain Personally Identifiable Information 


(PIT) that may be sensitive if included in the model. 


On the commercial side, or more business to business type applications 


1. May include confidential customer data. This commercial data may 


potentially be “more valuable” than any single consumer record. 


There may be government regulation such as HIPAA or other compliance 


requirements. 
What a mess right? 
Other day to day considerations that may come up: 


1. Can annotators download the data to their local machine? 
2. Should an annotator be able to access records after completing 


(submitting) them? Or are they locked out by default? 


On the software side there are generally two major models that most 


approaches fall into 


1. Task only availability. This means that as an annotator user, I can only 
see my current assigned task (or set of tasks). 
2. Project level. As an annotator I can see a set of tasks, or even multiple 


sets of tasks. 


As a project administrator the two big decisions are basically around 


1. Structuring the data flow so that only data that is tagged as having 
consent, and/or meets other PII requirements, ever enters the annotation 
task flow at all. 


2. Deciding at what level annotators see tasks. 


Data Science Access 


Naturally data science must access this data at some point to do work on it. 
Often data science gets a fairly free hand to “look at” the data. A more strict 
system may log the data, or only specification of what query is made, and 
the data may be sent directly to the training apparatus, bypassing the data 


scientist’s local machine. 


It’s worth considering that a single breach of a data scientist’s access is 
often more severe by many orders of magnitude than annotators. An 
annotator, even if able to bypass various security mechanisms and store all 
the data they see, may only see a small portion of the data of a large project. 


Whereas a data scientist may have 100s of times more access. 


Root Level Access 


A super admin type user, IT administrator, etc. may have some levels of 
root system access. This may be classified as a super admin in the 


application, have direct database access etc. 


EXPLAINABILITY SIDEBAR 


Professor at MIT CSAIL Lab, Regina Barzialy says: 


“Tt’s like a dog, which can smell much better than us, explaining how it can 
smell something. We just don’t have that capacity. I think that as the 
machines become much more advanced, this is the big question. What 
explanation would convince you if you on your own cannot solve this 


task?” 2 


The concept of explainability is important, but usually reserved more for 


the machine learning model analysis side. 


Open Source and Closed Source 


Open vs Closed source is an argument as old as time. I’d like to take a 
moment to highlight some specifics I have seen relative to this training data 


ared. 


Open and closed source annotation takes on a special consideration for the 
rapidly changing training data landscape because the majority of this new 


generation of tooling is closed source. 


There have been many open source annotation tooling projects - some over 


10 years old. However in general most of those projects are no longer 


maintained, or are very niche and not general purpose tools. Currently, the 
two general purpose “second generation” annotation tools in open source 
are Diffgram and LabelStudio. To be sure there are many other tools - but 


most are focused on very specific considerations or applications. 


Open source software has many advantages - especially in this privacy 
focused area. You can see exactly what the source code is doing with your 


data and be sure there is no nefarious activity afoot. 


Open source does have some disadvantages. Most notably the initial setup 
of the system itself may be more difficult (not setup or ingestion of data, 


just the actual installation of the overall software). 


The commercial costs of both open source and closed source may be 
similar, just because the code is open does not mean the license is 
unlimited. Ease of use is often similar in the context of commercially 


backed projects. 


The costs of hosting open source are controlled by you. In general, the cost 
of hosting is rolled into the cost paid to a commercial provider. This is a 
nuanced tradeoff but in practice often is similar at small and medium scales. 


At great volumes usually the more control you have the better it is for you. 


Open source may tend to have greater compatibility, since often there is 
more use - from free users - who still run into issues and write tickets. This 


can mean less technical risk. 


Costs are also similar. Commercially backed open source projects usually 
require upgrade to a paid version at some point during commercial use. 
Sometimes there may be an option to forgo paying, but at the very least that 


means less support. 


Deployment 


Two of the most common deployment options can be generally bundled into 


client installed or software as a service. 


Client Installed Deployment vs Software as a Service 


While the local vs remote discussion is as old as time I’ ll call attention to 


the specifics that matter for Training Data. 


First, the volume of training data is often quite high, ten to a thousand times 
more, then compared to many other typical software use cases. Second, the 
data is often of such a very sensitive nature, so much so that “air gapped” - 
meaning the hardware is physically isolated from the world by air - is not 
an uncommon phrase. Third, because the completed training data is code, 
it’s best to think of protecting it as similar to protecting regular software 


source code. 


The net result of this is that there is a massive difference between a tool you 


can install on your own hardware, and using the free tier of a service. For 


example, unlike a product like gmail that I can use about as much as I want 
for free, most Training Data services have severe limits on the free tier. And 
some services may even have privacy clauses that permit them to use your 


data to build “mega” models that are to their benefit. 


Given these variables,there is a clear edge to training data products that can 
be installed on your hardware from day one. Keeping in mind “your 
hardware” may mean your cluster in a popular cloud provider. There are 
downsides to local deployment, but as packaging options improve (e.g. 
docker) it becomes easier and easier to get up and running on your own 


hardware. 


This is also another area where open source really shines. While many 
service focused providers often have more expensive versions that can be 
locally deployed, the ability to inspect the source code is often still limited 
(or even zero!). Further, these use cases are often rather rigid: it’s a preset 
setup and requirement set. Whereas software that’s designed to run 


anywhere can be more flexible to your specific deployment needs. 


Costs 


Besides commercial listening costs, all client installed tools have hardware 
costs to host and store data. Further some tools charge for use of special 


tools, compute use, and more. 


Often licensing is the largest cost. A few ways to reduce costs include: 


1. Push automations onto the client as much as possible. This reduces the 
cost of server side GPUs. 


2. Separate true data science training costs from annotation automation. 


Common licensing models include by user, by cluster, or other more 


specific metrics. 


Annotation Interfaces 


Naturally there needs to be some kind of interface that a human uses to 
instruct and supervise the machine. Early open source projects almost 
exclusively focused on this. Point to raw data, add some kind of basic 
labels, and start labeling! The major categories of interfaces fall along the 
computer vision and natural language processing lines. Most tools that have 
an image interface will also have a video interface. Most text tools don’t 


have full fledged image or video interfaces. 


This can be a very opinionated area. In general, interfaces are trending 
towards having relatively similar feature sets here, such as similar hotkeys, 


for example. 


In part because it’s often the easiest and most visible differences, there is a 


growing consensus on the major sub themes of what an interface must have. 


While differences definitely exist, I suspect the core aspects of this area will 


become more similar over time. 


The one thing to be aware of with user interfaces is depth, especially of 
features relevant to your use case. For example, in some interfaces it’s easy 
to change the size of vertices, line width etc. If you have a medical use case 


that requires this, that may be a big deal. 


Use caution to ensure that certain aspects of the labeling interface often get 
overblown. Some people get quite excited about what percent of the screen 
the image takes up, or how many clicks it takes to do something. While 
these things do matter, over focus on it misses the plethora of other 


important concerns. 


Some platforms allow customization of the UI itself. Later in this chapter 
this is covered in mode depth under Ergonomics of Labeling. If you have a 
very unique data type, determining the necessary level of support will likely 


need to be one of your first steps. 


User Experiences 


The experience of using the application at its “boundaries” is often one of 
the toughest areas. Once you are in the walled garden of the system you can 
expect it to work. Often the best tools here have gone to great lengths to 
make the experience of getting data into the system and getting it out as 


easy as possible. 


Modeling Integration 


Think of this like watching a finished movie versus the editing suite used to 
make it. In general the Training Data system needs to support receiving 
some kind data from the modeling system, and supporting it in a training 


data oriented way. 


Most modeling systems have fairly limited support for rendering, lit-alone 
editing, training data. This can often be confusing because these systems 


can usually render failure competent basic visuals. 


Modeling integration is related to Streaming data, but they are different 


concepts. 


Multi-User vs Single User 


All modern systems are multi-user. In general if a system is single user it 
probably is not part of the modern training data paradigm and is likely 
focused on the pure user interface portion. The primary reasons for needing 


multi-user are expertise and volume. 


That being said, many systems can be operated successfully for a limited 
prototype by a single user, or may run on a single local machine for testing 
purposes. Much of this chapter is focused on multi-user, and team based 


systems. 


Integrations 


Parts of Training Data tooling are buried in the technical stack while others 


are surfaced to end users. 


The most basic concept is that you must be able to get raw data and 


predictions in and annotations out. Considerations include: 


e Hardware. Will it run in my environment? Will it work with my storage 
provider? 

e Software Infrastructure. Can I use the training systems, analytics, 
databases, etc. I want with it? 

e Applications and Services. How well does it integrate with my systems? 
Backend and frontend? 

¢ What types of custom integrations through APIs and SDKs are 
available? 


¢ How do I get the data back and forth to the training data system? 


There is a lot of variation here. Many tools provide very little in the way of 
integration. A lot force integrations to be API or technical team driven. In 
general, the more modern systems offer UI based interactions with 
integration processes. Not just for setting up keys, but also pulling and 


pushing the data. 


Ease of Use 


Ease of use comes in many forms. There are APIs and SDKs., installations, 
super admin user, admin user, annotator use, etc. Depending on your use 


case these all likely carry varying degrees of weight. 


Annotator Ease of Use 


Often, one of the first thoughts is annotator ease of use. Perhaps you have 
people who only use the software a few hours a month. Or others who 
require training for anything that’s not bulletproof. Or perhaps you care 
most about efficiency. If there’s one thing that’s clear about the annotation 
UI it’s that everyone has different opinions. Ask 10 annotation project 
managers what’s the best UI and you will get 20 opinions. A few trends that 


do seem to be coming more clear. 


As a small comparison, consider that for git - a popular coding tool - there 
are many different ways to view the UI. Two popular ones are the split and 
unified views as shown in Fig 6-5. If you previously were used to always 
using one, but a different annotation UI defaulted to the other view (or only 
offered it), it may feel wrong. This is just one dimension, due the wide 
feature space annotation covers, there are often many different expectations 


for how the UI should look. 


Split Unified 


Figure 6-5. Example of two different views of git data. 


Specification of Entire UI Appearance by Task Manager 


There is a growing trend for the annotation manager to specify the entire 
appearance of the UI in the task. This means that both the label Schema, as 
well as schema for the user interface. This could include show/hiding 


buttons, ordering of buttons, sizing of UI elements etc. 


Ergonomics of Labeling 


The first thing to try and level set why this is so hard, is that there are many 
different flows for labeling. As a loose analogy, consider the difference 
between a git diff, and a unified and a split view. Both show the same thing 


but in different ways. 


Labeling tools are similar, in that depending on your use case, or especially 
whatever tools you happen to be familiar with, you may have strong needs 


and opinions on what’s good and bad. 
Context Specific Tooling 


A growing trend is for various UI elements, menus etc, to be heavily based 
on the context of the annotation. This started with what data type (image, 
video etc), and now has shifted deeper into what instance type, and even the 
context of the specific annotation. For example, a user who is on a video 
annotation, and right clicks an annotation with a polygon, may see a totally 


different menu from right clicking a newly created bounding box. 


[IMAGE TO COME] [Insert visual showing the example of different UIs 
by type (3D, visual, text)] 


[IMAGE TO COME] [insert visual showing context menu change for auto 


bordering ... | 


A Growing Focus on Capability over Shortcuts 


Consider that Excel has over 200 popular shortcuts. My guess is most users 
only know a small fraction of them, yet are able to use excel perfectly fine 
for their jobs. Some people get very concerned about shortcuts (hotkeys), 
and while shortcuts are important, as the levels of functions start to rapidly 
exceed a person’s ability to remember hotkeys for them, context and UI 
design start to play a bigger picture. At least for me it’s slower to remember 
that header 1 in google docs is “Ctrl+Alt+1°, than to just click Header 1 on 


the UI. I just don’t quite use it often enough to remember it. 


There are so many things that can impact annotation efficiency. Some of 
this can be chosen by a labeling admin - perhaps for some cases an extra 
“confirm” prompt when completing a task may feel like a huge burden, 
while in others it’s a crucial step. As annotation continues to become more 
complex, and new users enter the picture, there is a shift away from 
shortcuts, and more to making sure the UI has the capabilities, and shows a 


reasonable context so the user can make use of those capabilities. 


Ease of use for different data types 


As a small story, in one day, one user explained to me how they needed the 
interface to be much simpler for global assessments of the file. Meaning 
that as a user I can make a multiple choice question come up the moment a 


file loads without any further user interaction. I agreed. 


On that same day, another user expressed frustration at not being able to 
zoom in to 3,000% (the limit was 1,000%), because they needed to identify 


a specific pixel on a 4K resolution image. I also agreed with that need. 


My point here is simply that what’s easy to use depends greatly on your 
perspective. If your annotators will balk at having to do two extra clicks for 
each file, or at having to squint because they can’t zoom in enough, both 


can be problems. 


Ease of use in Different flows 


The ease of updating existing data is often much different from creating all 


new annotations. 


Vastly Different Assumptions 


I tried one popular annotation UI where the delete key deleted the entire 
series across all video frames. This would be like painstakingly crafting an 
entire spreadsheet, only to bump the delete key and have it delete the entire 
sheet! Even though I was just testing, did I ever get a jolt when that 


happened! 


Of course, someone else could argue that it’s easier to use, since I need only 
select an object, and click delete, I don’t have to worry about the concept of 
a series, or that it appears in multiple frames. Again - what’s right here will 


depend on your use case. If you have complex per frame attributes, a single 


delete for what could be days of work is probably bad. Conversely if you 


have a simple instance type in some cases perhaps it’s desired. 


Again, customization by admins and users comes to the rescue. Do you 
want to see the previous annotation in the next frame of the video? Or you 


don’t? Choose what’s right for you, set it and forget it. 


Look at Settings not first Impression 


Even seemingly simple things, like the font size, position, and background 
of label tags, is actually all very dependent on the use case. For some, 
seeing any label visually may get in the way. For others, the entire meaning 


is in the attributes and not showing it slows down progress significantly. 


Same with polygon size, vertex size etc. For one user, they may be unhappy 
if the polygon points are hard to grab and move, another may want no 
points at all so a segmentation line on a medical image can be perfectly 


seen. 


If there’s one enduring theme there is to look less at the appearance of the 
UI upon first glance, and more at what settings can be adjusted - or could be 


added - to meet your desired needs. 


Is it easy to use, or just lacking features? 


Another trade-off is that some vendors simply decline to enable features by 
default, requiring each flow to be planned out. For example this may mean 
that instance types aren’t available in video, or settings may not exist, etc. 
When evaluating, think hard about the ongoing usage and needs for more 


complex scenarios and if it will handle it. 


Customization is the name of the game 


There is increasingly an expectation of customization at all levels of the 
software. From annotation settings, to admin configurations, to actually 
changing the software itself. Try to be aware of what’s “hard” and what’s 


“easy” for your given provider. 


For example, for a closed source provider, adding a new storage backend 
may be a low priority. With an open source project you may be able to 
contribute this yourself, or encourage others in the community to do so. 
Also you may be able to better scope out and understand the impact of 


changes and costs involved. 


On the enterprise side of it, try to understand what the core of the software 
really is for your use case. Is it a complete integrated platform? Is it the data 
storage and access layers? Is it the workflow or annotation UI? Because 
some of these tools vary so dramatically in scope and maturity it can be 


hard to compare. One tool may for example have a better spatial annotation 


UI, but be substantially lacking in the many other dimensions like the 


ability to update data, ingest, query data etc. 


As a small story, a user noticed that when a task was already completed, 
pushing a recently added ‘defer task’ button, led to a poorly defined state in 
the system. I agreed this was an issue. The fix was one line of code - a 


single if statement. 


On the other hand, if a vendor doesn’t offer major features like data 
querying, streaming, wizard based ingestion etc, those may all be multi- 
month projects, multi-year epics, or even never be added at all. Because this 
is anew area, with vastly different assumptions and expectations, I really 
encourage you to first consider the major features, and then look at the 
speed of updates and execution on improvements. A vendor that can adapt 


quickly is especially valuable in this new area. 


Another had experienced in a different UI that deleting singular points were 
not “recoverable”, meaning that if say a hand was occluded on a keypoint 
figure, and I marked it that way, that if I went to undo it I couldn’t get it 
back. In Diffgram, the way the system was setup it was easy to maintain 


this on a per point basis. 


Installation and organization 


This covers common technologies used for installation and common 


technical trade offs. 


Docker 


A docker image is a standard way to package software. 


Docker Compose 


Standard way to group multiple docker images. 


Kubernetes 


This is typically the go to production recommendation, although there are 
many other options. In theory any time the docker images are provided you 


can manage those images as you see fit. 


The major cloud providers have noticeably different implementations of 
Kubermetes. Specifically what may take hours of work on one platform is 
sometimes much easier on others. This may be somewhat surprising given 
that Kuberetes is open source, but some abstract more of it away than 


others do. 


Why am I talking about this? Well as we discussed earlier, training data 


often represents a volume of data unparalleled in other computer cases. 


Further, the expectations around data access, storage, and usage are new, 


and often don’t align with many pre-optimized use cases. 
Choosing a BLOB Storage 


1. Where in the world is the system going to be deployed? If you have 
annotators in another country, how does that impact your performance 


and security goals? 


2. If a cloud storage option is not available, what types of local options will 


meet your needs? 


Choosing a Database 


Diffgram uses PostgreSQL by default. Many other databases are available. 


Configuration Choices 


Where to store the data 
Depending on your use cases you may see: 


1. The original data 

2. A web optimized version(s). Like the way YouTube creates multiple 
resolutions of the same video. 

3. Sub segments of the data - such as frames. 


4. Duplication into datasets 


5. Duplication from erroneous imports, such as importing the same data 


twice. Perhaps with intention to pre-label. 


All of this is basically to see that some form of duplication is unavoidable. 


Generally the choices here are 
1. Delete the raw data after the optimized versions are created. 
Storing Individual Frames (Video Specific) 


Do you need to access each individual frame on demand, or do you only 


need to know the frame number? 


Versioning Resolution 


How many versions - potentially all - of previous annotations are required? 


Should every change be recorded? 


In some systems, this may be critical, or simply a useful feature. As a rule 
of thumb, turning on complete versioning will likely result in at least 80% 


of the database being composed of these soft deleted annotations. 


Retention Period 


How long do you need to store the data? Can some be automatically 


archived after a period of time? 


Bias in training data 


There is Bias at many levels 


1. Human bias 


2. Technical bias 


The technical concept of Bias 


Bias is the fixed value that is added to the more variable part of the 
calculation. For example if you wanted the model to return 3 when the sum 
of the weights is 0, you can add a bias of 3.2 While this may be useful to 


researchers, the bias that we are considered about is different. 


This isn’t your grandfather’s Bias 


In classic ML we so often encounter the phrase “imbalanced dataset”. From 
the perspective of training data, this is not as straightforward as it may 
appear. Consider for example, that we are designing a threat detection 


system for an airport 3D scanner. 
[IMAGE TO COME] [insert visual of 3d millimeter wave scanner] 


We may have classes like “forearm” and “threat”. How many forearm 
examples do we need? How many threat examples do we need? Well the 


forearm, in this context, has a very low variability. Meaning that we likely 


only need a small sample set to build a great model. However the threat 
placement, and purposeful efforts to obscure it, mean we likely need many 
more examples. At first glance in that context then, the “forearm” may be 
imbalanced to the “threat” but it’s actually desirable. Another way to 
approach this is to subdivide threat into smaller categories. That misses the 
point that “threat” as a concept is a harder problem than “forearm”. Maybe 
it needs 10x as much data, or maybe 100x, it doesn’t matter how much 


more it needs, it matters that it’s performant on that class. 


This leads us to a slightly more subtle problem. We have been assuming for 
example that all examples of a “threat” are equal. But what if the training 


data isn’t representative of the real life data? 


The thing is that correcting “obvious” bias has a variety of technical 
solutions (such as hard negative mining), but correcting the relationship of 
the data to the real world does not. This is similar to how, for example,a 
classic program can pass 100s of “unit tests” but still utterly fail to meet the 


end user’s needs. 


Desirable Bias 


The model needs to be “biased” towards detecting what you want it to 
detect. So keep in mind that from that perspective you are trying to make 


the model “biased” towards understanding your world view of the data. 


Bias is hard to escape 
Imagine this scenario 


1. A dataset is created in month 1 
2. To maintain freshness, only “new” data from the last 6 months is used 
3. To optimize the model, a sampling of the output and errors is reviewed 


and correct by humans 


What this really means though, is that every “new” example is recycled into 
the model. In other words, a prediction and subsequent correction happens 
on day 1. How long can we use this? We presume this “fresh” correction is 


valid for 6 months. But is it really? 


’ cc 


Well the reality is that even if it’s “correct”, it’s basis was a model that is 
now old. This means that even if we retrained the model using only data 
corrected within the last 6 months, there is still bias from the “old” model 


creeping in. 


This is an incredibly hard problem to model. I’m not aware of a scientific 
solution here. This is more of a process thing to be aware of that decisions 


made today can be difficult to completely roll back tomorrow. 


An analogy in coding may be system architecture. A function may be 
relatively easy to “correct”, but it’s harder to realize if that function should 


exist at all. As a practical matter, an engineer working to correct an existing 


function will likely start with the existing function, so even the corrected 
function will contain the “spirit” of the old function, even if literally every 


char is changed. 


Besides the literal data and instances, a further example here is the Label 
Templates. If the assumption is to always use the existing predictions, it 


may be hard to realize if the Templates are actually relevant any more. 


Metadata 


Imagine you spent thousands of hours, (and potentially hundreds of 
thousands of dollars,) to create multiple datasets. Only to realize it wasn’t 
clear what assumptions were present when it was created. There are many 
reasons, some described below, that a dataset that is technically complete 


can become largely unusable. 


A surprisingly common problem is losing information on how a set was 
constructed. Imagine for example that a data supervisor has a question 
about a project and sends it over a channel like email or chat. The problem, 
such as “how do we handle a case when such and such happens”, is 


resolved and life goes on. However, months later... 


Metadata of a Dataset (Definition) 
Data about the dataset that is not directly used by the model. For 


example, when the set was created, who created it.4 


Lost Metadata 


Examples of metadata that is commonly “lost” from sets: 


e What was the original purpose of the set? 

e Was the element by machine originally vs by human? What Assist 
methods if any were used? 

e When was the data captured? When was it supervised? 

e Sensor type and other data specifications 

e Was the element reviewed by multiple humans? 

¢ Who were the humans who created it? 

¢ What was the context in which the humans created it? 

e What other options (eg per Attribute group) were presented? 

e Was the templating schema changed during set construction? If so, 
when? 

e How representative is the set vs the “original” data? 

e When was the set constructed? For example the raw data may have a 
timestamp, but that is likely to be different from when a human looked at 
(and this is likely to be different per sample). 

e What guides were shown to supervisors? Were those guides modified, if 
so when? 

e What labels are in this set and what are the distribution of the labels? 

e What is the schema of relating the raw data to the annotations? For 
example, sometimes annotations are stored at rest as a filename like 


“00001.png”. But that assumes that its’ in the folder “xyz”. So if for 


some reason that changes, or isn’t recorded somewhere it can be unclear 
which annotations below to which. 
e Is this only “completed” data? Does the absence of an annotation here 


mean the concept doesn’t exist? 


This can be avoided by capturing as much information as reasonably 
possible during the creation process. In general, using a professional 


annotation software will help with this. 


https://linuxfoundation.org/wp- 


content/uploads/LFResearch_ SBOM Report _020422.pdf page 6 


https://www.technologyreview.com/2020/09/23/1008757/interview-winner- 


million-dollar-ai-prize-cancer-healthcare-regulation/ 


Example inspired by https://medium.com/fintechexplained/neural-networks- 


bias-and-weights-10b53e6285da 


Side note: Some people refer to annotations as “metadata”. I don’t because 
annotations are a primary component of the overall data structure. If you 


wish to call annotations metadata, then everything here is meta-metadata. 


Chapter 7. Al Transformation 


A NOTE FOR EARLY RELEASE READERS 
With Early Release ebooks, you get books in their earliest form—the 
author’s raw and unedited content as they write—so you can take advantage 


of these technologies long before the official release of these titles. 


This will be the 7th chapter of the final book. Please note that the GitHub 


repo will be made active later on. 


If you have comments about how we might improve the content and/or 
examples in this book, or if you notice missing material within this chapter, 


please reach out to the editor at jleonard@oreilly.com. 


AI Transformation Introduction 


Beyond Digital Transformation, this is the start of the AI transformation 


erd. 


The companies that are the most successful here will not be the ones who 
have a single crack “AI team” with mystical sorcerers. It will be the 
companies who embed the concept of AI most broadly, and push 
responsibility to train the AI systems down to the lowest capable levels of 


the firm that will succeed. 


AI Transformation is the answer for business leaders who want to transform 
their company to become AI first. It can start today with you. So far we 
have covered the basics of training data and taken deep dives into specific 
areas like automation. Now we zoom out to see the forest. How do you 


actually get started with training data at your company? 


In this chapter I will share five key steps. Starting with mindset and 
leadership, then onto the concrete problem definition, and concluding with 
the two key steps to solving it: annotation talent and training data tools. 


Please consider the plan merely a starting point to be adapted to your needs. 


To get started fast adopting modern AI in your company here are five key 


things to act on: 


1. Create new meaning instead of analyzing history - The Creative 
Revolution 

2. Appoint someone to lead the charge 

3. Target use cases involving experts, existing work, and work of a high 
volume 

4. Rethink AI Annotation Talent - Quality over Quantity 


5. Adopt modern Training Data tools 


that involve 
experts and ‘ 
existing work ee ae Quality over 
: Quantity 
Annotation Talent 


Figure 7-1. AI Transformation Map 


The Creative Revolution is a mindset of using human guidance to create 
new data points to solve problems. Opening up to the magnitude of this 


potential will help prime you and your team to define the best use cases. 


Next is appointing someone to lead the charge - a Director of Training Data. 
This position will be new in most companies. Given the bulk of these AI 
costs are human labor in training data this position naturally needs someone 


to account for it. 


Then Use Case Discovery. I provide Canvas upon which to sketch your 
ideas. Example questions to ask, and common missteps to avoid. Moving to 
implementation, it’s important to set a Talent Vision. Rethinking Annotation 
Talent will empower you to achieve your use cases and to also do so ina 


cost effective manner. 


When you have the mindset, the leader, the use case, and the talent vision, 
you can use modern Training Data tools to make it a reality. Training Data 
tools have advanced considerably in the past few years and there’s lots to 


learn! 


Getting Started 


Seeing your Day to Day Work As Annotation 


For the last decade or so the big wins with AI have been with classical cases 
as discussed in earlier chapters. Now the biggest commercial opportunities 
are with supervised learning, unstructured data, which requires Annotation. 
Early work was often a case of putting together an “Annotation project”. 
This meant throwing data over a fence to some team and hoping for the 


best. 


This is like ordering fast food. Sure it will help with immediate hunger, but 


it’s not a healthy long term solution. True AI transformation, like eating 


healthy, takes work. It’s a mindset shift, from seeing Annotation as a one off 


project, to seeing your day to day work as annotation. 


To frame this, consider this statement with regards to your company: 


All of your company’s day to day work can be thought of as 


annotation. 


That’s right. Every action that the majority of your employees take, every 
day, is both literally and figuratively, annotation. The real question here is, 
how can we shift those actions from being lost, every day, to being captured 


in a way that can be repeated, to the same quality degree, automatically. 


| have 
5 documents 
to annotate 


| have 
5 documents 


to review 


Figure 7-2. Thought process shift from review to annotate 


To invert this, every moment an employee does something that’s not being 
captured as an annotation, that’s productive work lost. The greater the 
percentage of that work that can be captured through annotation the greater 


your productivity is. 


Before After 


All day to day work Day to Day work is handling the 


is “one offs” exceptions not yet captured in 
annotations 
Training is for Training is for machines too 


humans only 


If it’s not in the If it’s not in training data it doesn’t exist 
computer it doesn’t 


exist 


A rough analogy here is that in the movement to Digital, if it didn’t exist in 
a Digital form, then it didn’t exist. (Whether that was true or not). Now in 
the same way, if it doesn’t exist in Training Data, it might as well not exist 


either since it won’t help your company improve its’ Productivity. 


There are two major types of AI Transformation: 


1. At a classic company, inspiring all relevant aspects of operations to give 
consideration to AI and establishing new reporting units. 
2. Ata AI product company, inspiring a Training Data first mindset and re- 


organizing reporting relations 


The Creative Revolution of Data Centric Al 


Data Centric AI can be thought of as focusing on training data in addition 
to, as even more important than data science modeling. But this definition 
does not really do it justice. Instead, consider that Data Centric AI is more 


about creating new data to solve problems. 
The critical realization: you can create new data 
In the data centric mindset you can: 


1. Use or add data collection points. 
1. New sensors. New cameras, new ways to capture documents etc. 


2. Add new human knowledge. 


For example, for a self driving case, if you want to detect people getting cut 


off, you can create the meaning of “Getting cut off” as shown in Fig 7-3. 


1. Data 

Process Collection ( Use 
existing or add 
new ) 


Example Access or 
Add Camera 


ac 3. Detection 
meaning Problem Solved 


i “Getti Automatically 
Tell it What “Getting —— 
Cut Off’ means avg Getting 


Figure 7-3. First example of creating new data for data centric AI 


Or if you want to automatically detect what a “Foul” means you can create 


that too. 
Collect 
Example Existing TV 
Data 


Figure 7-4. 


Tell it What a Automatically 
“Foul” means detects "Player 
getting fouled” 


Second Example of creating new data 


You can change what data you collect 


This may be obvious. Let’s consider how different this is. Classically in 
data science you could not change the data you collected. For example if 
you were collecting sales data, the sales history was just that - history. 
Small specifics aside, the sales were whatever they were. You can’t invent 


new Sales, or really change the data. 


With Training Data, you can literally collect new data. You can place new 
cameras. Launch more satellites. Install more medical imaging devices. 
Add more microphones. Change the frequency of collection. Increase the 


quality, e.g. the resolution of the camera. 


By focusing on the data that you can control, you can directly improve the 
performance. Better camera angle? Better AI. More cameras? Better AI. 


More...? I think you get the picture. 


You can change the meaning of the data 


Back to our sales example. A sale is a sale. There’s little value in trying to 


expand a row in a spreadsheet to mean something more than it is. 


With Training data, you literally create meaning that was in no way shape 
or form there before. You look at a piece of media, like an image, to which 


the computer had no meaningful structure before - and literally say “this is a 


9” 66 9” 66 


human”, “this is a diet soda”, “this is a lane line”. This act of annotation 


maps your knowledge into the computer. 


The limit is only your creativity. You can say “this human is sad”, or “this 
diet soda has a dent in it”. You craft and mold and modify it to your needs. 


These infinite degrees of freedom is what makes it so powerful. 


You can create! 


So next time someone says to you that “data centric AI is a way to get 
better model performance” you will know it’s so much more than that! It 
means you can change the data you collect, and the meaning of that data. It 
means you can encode your understanding of a problem in a whole new 
way. It means you can define the solution even if no solution existed prior. 


You can create! 


Think Step Function Improvement 


Consider retail shopping. We can supervise machines to tell them what 


people look like and what groceries look like. 


This unlocks all new use cases like entirely replacing the cashier. This is not 
5% better pricing. This is a fundamental shift in how we shop for groceries 


and design stores. 


Concept Before After 


Shopping Every time I shop, Annotation of People 
every time a cashier shopping 
checks me out, that 


work is lost. 


Driving Every time I drive, Annotation of Driving. 
the effort put in, the Professional 
work, is “lost”. annotators annotate 
common scenes. 
My driving is captured 


to aid in this effort. 


Document Every document I Annotation of 
Review review, the work is document. Work is 
(imagine lost captured to reduce 
loans, requests similar work in the 
etc) future. 


The key insight which bears repeating: Anything you can annotate can be 


repeated. 


Naturally there are limits to this. Some of the efforts described above 
require many people and years to implement. But conceptually the idea is 


there. 


Appoint a Leader: a Director of Training Data 


All revolutions need leaders. Someone to preach the new message. To rally 
the troops. To reassure doubts. And the leader must have a team. In this 
section, I’1l lay out best practices, common job roles, and discuss how they 
all come together to form an optimal team structure to support your 


Training Data revolution. 


Team organizational concepts are key to training data success. From the 
company’s viewpoint what is changing? Are the differences between 
training data and data science reflecting in the organization? What new 
organizational structures are needed? Even if you are already in an Al 
centric organization there are training data specific nuances that can help 


accelerate your progress. 


Go From a Work Pool to Standard Expectation for All 


Right now, the de-facto standard process of getting a bunch of people 
together to annotate, is akin to an old typing pool. An army of people, doing 
relatively similar work, in order to translate from one medium to another. 


This is clearly inefficient. 


Instead, what if every new or updated process gave first thought to Al 
Transformation? What if every new application thought first about how the 
work could be captured as annotation? What if every line of business leader 


thought first about how annotation factored into their work? 


Before After 
Hire a new Separate pool Your existing experts and data entry 
of workers. folks (Primarily) 


(usually outsourced). 


One off projects, separate Part of daily work, like using email 
one time efforts or word processing 
“Bolt on” mindset “AI First” Assumes AI will be 


present or demands it 
Integrated systems, aiding in the 


flow of existing work 


Al being “Pushed” onto People “Pulling” AI into the org 
people 


Yes, people who are training AI instead of just doing their normal job may 


demand a higher wage. However, the return on capital is still a great deal 


better when paying a % higher wage for one person, who when that 


combined with AI is as productive as 2-3 people. 


During the transition, it’s natural you still may need additional help. 
Depending on your business needs, there may be valid use cases that 
require outsourcing. As with any labor there is a need for a spectrum of 
annotation labor. But the key difference here is that the annotation is seen as 
normal work, and not a separate project for “those people” to do. The 
existence of a range of talent is a different concept from where you get that 
talant. A pool of workers exclusively hired “to annotate” without any other 
context to your business is distinctly different from any level of worker 


hired into your business directly. 


Another way to phrase this, imagine a company of 250 people. Hiring 50 
people overnight would be a massive deal. Yet that same company may 
think it’s ok to hire 50 annotators. Try to see it more as truly hiring 50 


people into your company. 


You may already be already thinking there are a few areas that would be 
good targets, and/or “Well this sounds great, but I just can’t see a way to 
annotate the such and such process”. Why I bring up this typing pool 
concept is that, while AI is a bolt-on, after the fact process, there will 


always be such hurdles. The more the organization is pulling AI in, seeing 


annotation as a new part of their day to day work, being directly involved in 


annotation, the more opportunities will come. 


Sometimes Proposals and Corrections, Sometimes 


Replacement 


A simple example of integrated proposal and correction that you may have 
perhaps already used is email. For example,In Gmail it will prompt you 
with a suggested phrase as you’re typing. That phrase can be accepted or 
rejected. Additionally the suggestion can be marked as “bad” to help correct 


future recommendations, proposals, predictions etc.. 


will prompt you to 


Figure 7-5. Example of AI proposal to user 


IMAGE T0 COME 


Figure 7-6. Example of integrated training data collection: 


This highlights an important consideration for all of the products you buy as 
you go forward. It also circles back to the theme of making someone more 


productive rather than directly replacing them. 


Upstream Producers and Downstream Consumers 


Training Data work is upstream to Data Science. Failures in the training 
data flow down to data science (as shown in Fig 7-8). Therefore, it’s 


important to get Training Data right. 
Why am I making a distinction between Training Data and Data Science? 


Because there are clear differences between day to day responsibilities of 


people producing Training Data and Data Science people consuming it. 


Figure 7-7. Relationship between production of Training Data and downstream use by Data Science 


I think of it as a Producer and Consumer relationship. 


Producer and Consumer Comparison 


Training Data - Producer Data Science - Consumer 


Capturing business understanding Creates models that map 


and needs in a form usable by data fresh data back to business 


science. 
Converting Unstructured Data to 


Structured Data 


Responsible for Annotation 


Workflow 


Manages Dataset Creation, 


Curation, Maintenance 


Supervising Output from Data 


Science 


Example: KPIs?: 

% of Business Need Covered by 
Data 

% of Annotation Reworked 
Required 

Volume, Variety, and Velocity of 
Annotation 


Depth of Annotation 


Needs 


Uses Annotation Output 


Uses Datasets, Mild 


curation 


Generatesing Prediction 


Output 


Example KPIs: 

Model Performance e.g. 
Recall or Accuracy 
Inference Runtime 
efficiency 
GPU/Hardware resources 


efficiency 


* Key Performance Indicator 


Producer and Consumer Mindset 


As a data scientist, the thoughts are often along the lines of “What datasets 
do we already have for x?” or “If I just had x dataset then we could do y”. 
There is almost a short circuit so to speak where a project starts, and the 
moment there is an idea for something the question is “how quickly can we 


get a dataset for this?”. 


An analogy is almost like I’m hungry and I want to eat something. I want to 
eat it now. I don’t want to worry about the farm crop or the harvester or 
anything like that. There’s nothing wrong with that - we all need to eat, but 
we must realize the distinction. And that the farmer (the producer of the 


training data) is equally important to this. 


Now a Farmer suffers from her own delusions here. The more one learns 
about training data, and the more emphasis is placed on production, the 


further one gets from concerns about how to actually use the data. 


As an illustration of this, I had a conversation with a leading training data 
production director who was trying to figure out how to get a specific type 


of rotated box. I suggested annotating as a 4 point polygon and that the box 


can be provided based on the bounds of the polygon. This was a surprise to 
him - he had thought of box and polygon as two totally distinct forms of 
annotation. The point here is that, the deeper you get into the training data 
world, the less the actual end usage of the data is remembered (or known) 
and the more the top level types and human interactions with the data then 


take the primary focus. 


Why is New Structure needed? 


First, bad data will mean bad AI. Bad AI means wasted investment in AI. 
This upstream role is so central to the success of AI projects that it must be 


given an appropriate role. 


Second, as part of the goal of AI transformation there must be a principally 
responsible individual to lead the charge. While a VP or CEO can also play 
this role at the strategic level the Director is responsible to execute this 


strategy. 


Third, as the volume of people involved balloons the simple reality is that 
this is a team of teams, and with people of many distinct characteristics. 
Even a most minimal team will likely have at least one or two production 


managers, and twenty to fifty Annotation Producers. 


This can easily grow to hundreds of people. In a very large organization 


there may be hundreds of even thousands of part-time annotation 


producers.2 


It’s an army of people to be managed. 
To recap: 


1. Bad data = bad AI 
2. AI Transformation Leader needed 


3. Army of Annotators 


The Inverted Budget 


One of the most perplexing things about data science and training data 
organization is that of budgeting. Often only a very small team of data 
science professionals are needed relative to a much larger training data 
team. From a cost perspective, the training data cost may be an order of 
magnitude greater than the data science cost. Yet somehow the data science 


line item is often the top level item. 
An improved setup is: 


1. AI/ML 
1. Training Data 
2. Data Science 
3. Data Engineering 
4. Etc. 


Historically one reason for this has been in part the large hardware cost that 
data science is responsible for. It’s worth noting that directionally it’s 
expected for this AI training and running cost to decrease with time. 
Further, from a divide and conquer perspective of resources, since Data 
Science is already burdened with managing this hardware cost it makes 


little sense to further cloud their mind with training data concerns. 


The Director’s Background 


A few skill sets that are essential to consider 


1. This is a people leadership role 

2. This is a change agent role 

3. The person must be in tune with the business needs 

4. Ideally the person is able to traverse multiple departments of the 
company, perhaps is already a corporate level analyst 

5. There must be some level of technical comprehension to facilitate 


discussion with engineering 
What the background need not be 


1. Formal education requirements. This is more a school of hard knocks 
role. In practice though this person may hold an MBA, under or graduate 
scientific degree, etc. Most likely they will also be up to date on the 


latest refresher and online courses in machine learning areas. 


2. A “Data Scientist”. In fact the more data science background the person 
has the more risk they focus on the algorithmic side over this new 


creative human driving side. 


The Director’s Budget 


1. People 
2. Tools 


Director of Training Data 


First, it is most ideal to have a Director of Training Data position 


established. 


This person can for example report to the VP of AI, VP of Engineering, or 
CTO. Even if this role is baked into some type of Director of AI role, the 


level of responsibility stand. 


Figure 7.9 illustrates both the Director of Training Data’s responsibilities 
and sample descriptions of the key team member roles. Please note these 
are not meant to be complete job descriptions just highlighting some of the 


key structural elements of the role. 


Line of Business 
Managers 


Expert Producers 


Dedicated Producers 


End User Produces 


Figure 7-8. New Org Chart Example 


Reading this Chart 


This talks about “teams” or plural “engineers”. Of course your org will not 
match this exactly. Think of it as a starting point. Each box can be thought 


of as arole. One person may play all or most of the roles. 
AI focused Company modifications 


1. May have less or zero evangelists 

2. Line of business manager may be the overall Product Manager 

3. The Producers will still vary depending on the company. For example 
the Experts may be the end users, or they may be part time and external, 


but still named partners not a generic “pool”. 


Classic Company modifications 


1. May have less dedicated producers and more evangelists 


Spectrum of Training Data Team Engagement 


1. Advisory and training 

2. Maintaining tool sets used by producers in other teams 
1. Data ingest/access 
2. Support for annotation production 


3. Actively managing production of annotation data 


Ideas around the primary organization pulling AI into it is not mutually 
exclusive with the idea of there being a team or department for training 
data. In an AI mature organization the team may be primarily acting as 
advisor, staying on top of the latest trends, and maintaining the overall 


tools. 


What’s right will depend on your specific org, my main intent here is to 
convey the general spectrum, and that the idea of a separate team is needed 
even if they aren’t the ones doing the literal production of the annotation 


data. 


Dedicated Producers and Other Teams 


Dedicated producers are direct reports of the Production Manager. This is 
for cases where the volume of work is such that the person’s full time job is 
annotation and they are not attached to any specific business unit. Again 


long term this may be rare, but it’s a reality for teams getting started, 


transitioning to this, and for various projects where no other production 


capabilities are available. 


For the sake of simplicity in the diagram, outsourced teams can be thought 


of as dedicated producers. 


Organizing Producers from Other Teams 


Producers in other business units run the spectrum from entry level to 


expert. 


End users, which may be different, may also produce their own annotations. 
Usually end user annotations are more “by accident” as part of using the 


application, or providing some form of minimal feedback. 
Here I will cover: 


1. Director of Training Data 

2. Training Data Evangelist 

3. Training Data Production Manager(s) 
4. Annotation Producer 


5. Data Engineering 
Let’s dive in! 


Director of Training Data Responsibilities 


Chiefly this person is responsible for overall Production of Training 


Data. This includes: 


1. Turning line of business needs into successfully produced training data. 

2. Generating work for training data production by mapping business needs 
to Training Data concepts. 

3. Managing a team of Production Managers who facilitate the day to day 
Annotation Production. 

4. Managing Evangelists who work with Line of Business managers to 
identify Training Data and AI Opportunities. Especially feasibility 
concerns regarding annotation. Of the various ideas proposed by say a 
line manager, only a handful may actually be cost-effective at that 
moment in time to annotate. 


5. Managing the Training Data Platform. 


Besides general efficiency and visibility into annotation work, this person 
must map the productivity in annotation back to the business use case. If 
possible the Evangelist may do this too, with the Director being the second 


line. 
And: 


6. Coordinate closely with data science to ensure the produce is being 


consumed as expected. 


7. Indirectly, to act as a check and balance on Data Science, acting as a 
supervision on the business results and output of data science that goes 
beyond purely quantitative statistics. 

8. Normal director level responsibilities, potentially some kind of Profit 
and Loss responsibility, KPIs, supplier and vendor relationships, 
reporting - example reporting relationship shown in Fig X.x., planning, 


hiring, firing, etc. 


Naturally the director can fill in for most any of the below roles as needed. 


Training Data Evangelist 


This role is an Educator, Trainer, and Change Agent. 
Primary responsibilities include: 


¢ Working closely with Line of Business managers to identify key 
Training Data and AI Opportunities. 

¢ Working “ahead” of Production managers, establishing the upcoming 
work and acting as the glue between the line of business managers 


and the production managers. 
In a company focused exclusively on AI products 


e Educating people on the best usages of modern supervised learning 


practice. 


In a classic business 


e Educating people in the organization on the effects of Al 
transformation. In practical terms converting interest to actionable 
annotation projects. 

e To recruit annotators from that line of business. On a practical level, 
this would be about converting an employee doing regular work, into 
someone who say as 20% of their job, is capturing their work in an 
annotation system. 

e Training. In the context of part-time annotators especially, this person 
is responsible to explain how to use tools and troubleshooting issues. 
This is distinct from the Production Managers who are more geared 
towards training full-time annotators. This is because you train a 


Doctor differently than you train an entry level employee. 


Training Data Production Manager(s) 


This person is chiefly concerned with being a taskmaster for actually 


getting annotation work completed. 


1. Interfacing with Data Science to setup the Schemas, setting up the tasks 
and workflow UIs, doing the Admin management of training data 
tooling (generally non-technical). 

2. Training Annotators 


3. Managing Day to Day Annotation Processes 


4. In some cases basic data loading and unloading can also be done by this 
person. 

5. When in change management type discussions, this person is responsible 
to explain the reasonableness of annotation work to people new to the 
matter. 


6. Using Data Curation tooling. 


Annotation Producer 


Annotation producers usually fall within two buckets: 


1. Full time, dedicated and trained people. 
1. This may be newly hired people or re-assignment from existing work 


2. Part time, it will increasingly become part of everyone’s job to a degree 


Data Engineering 


1. Responsible for getting the data loaded and unloaded, the technical 
aspects of training data tools, pipelines setup, pre-labeling, etc. 

2. Especially organizing getting data from various sources including 
internal teams 

3. Planning and architecting setup for new data elements 

4. Organizing integrations to understand the technical nuances of capturing 


Annotations 


Data Engineers would interface regularly with Data Science team(s) 


Historical Aside—a few reasons this wasn’t needed before: 


1. Classic machine learning the datasets already existed (even if messy), so 
there was no need to “Produce” a dataset. 

2. Earlier efforts were more separated from day to day business goals. This 
meant there was more of a rationale to do one time planning, one-off 
projects, isolated projects, etc. As AI transformation moves into the 


mainstream of your business this separation becomes an artificial barrier. 


Securing your AI Future 


Cliches about data being the new oil aside let’s think practically about this. 
If you invest $1 in training an employee, you get $1 of training for that 
single employee. If you invest $1 in capturing annotation work it will pay 


off many dollars (or be lost value) over time. How much? Let’s explore. 


Use Case Discovery 


How do we identify viable use cases? What is required and what is 
optional? In this section I provide a basic rubric to identify valid use cases. 


Then I expand into more context to help further identify good use cases. 
This section is organized from the most concrete to the most abstract. 


e Rubrics for good use cases 


e Example use case compared against the rubric 


e Conceptual effects, second order effects, and ongoing impacts 


The simplified rubric may be a go to in day to day work, while the rest can 
act as the supporting knowledge. While you are welcome to use the rubric 
exactly as they are, I encourage you to think of everything here as thought 


starters, merely an introduction to thinking about use cases for training data. 
Rubric for Good Use Cases 


At the highest level view, a good use case must have a way to capture raw 


data, and at least one of: 


e Repeated often 
e Involves experts 


e Adds a new capability 


The more of these things that are present likely the greater the value of the 


use Case. 


The simplified rubric looks like: 


Question Result (with example Requirement 


answers) 
Can we get the Yes/No Required 
raw data? 
Is it repeated Yes/Sometimes/No At least One of is 
often? Required 
Involve an Yes/Sometimes/No 
Expert? 
Adds a new Yes/Sometimes/No 
capability? 


That’s it! This can be a go-to reference rubric, and to support it when 


needed you can use the more detailed Rubric below. 
Detailed Rubric 


Now that we have the general idea, let’s expand on it. Here I provide more 
detailed questions, especially expanding and differentiating the new 
capability concepts. I add examples, counter examples, and some text on 


why it matters to help 


Test 


Is the data 
already 
captured? Or 
is there a clear 
opportunity to 
add more 
sensors to 
capture the 
data (in its 
entirety)? 


(Required) 


Does it 
involve 


Experts? 


Example 


Existing 
documents 
(e.g. invoices), 
existing 
sensors, 


adding sensors 


Doctor, 
Engineer, 
Lawyer, Some 


specialists 


Car dealer in 


person sales 


interaction 


Grocery 


shopping 


Counter ExampleWhy It Matters 


e Getting rav 


data captur 
isa 
required 
step. 

If you can’ 
get the raw 
data then 
the rest 
doesn’t 


matter! 


Makes a 
constrainec 
resource 
available tc 
more 
people (an 
potentially 


more often 


Is the work 
Repeated 
often? Many 
times per 
minute? 
Daily? 
Hourly? 
Weekly? 


Automatic 
Background 


removal/blur 
Customer 
service and 


sales 


Administrative 


and in mor 
situations). 
Expert 
opinions 
are of high 
value. 
More 
readily 
available 
data, often 
already in 
digital 


form. 


Already 
will have a 
well 
understood 
pattern (at 
least by 
humans). 
Likely 


already is 


document relatively 
review well 
constrainec 

e Often 
repeated 
tasks, 
viewed in 
aggregate, 
have a hig! 
value. 

e Existing 
raw data 
may be 
already 
being 


captured. 


Adds a new capability 


Unlocks new use cases, beyond augmentation or replacement. (Adds a new 


capability) 


Test 


Is the work done 
rarely because of 


expense? 


Would it be of great 
value to increase the 


frequency? 


Can we turn an 
approximate process 
into a more exact 


one? 


Does this process 
gloss over something 
because doing it in 
more depth is 


currently impractical? 


Are we currently 
substituting what we 


really want to figure 


Example 


Inspections 


Fruit ripeness. 
Produce mold or 
bruising detection, 


Dented can detection 


An airport detection 
system that just 
detects metal, vs 
something that can 
detect very specific 


threats 


Counter 
Example 


Stadium 


Construction 


Loan 


underwriting 


out for something 


more generic? 


Would improving this 


process’s accuracy 


lead to more benefits 


than harm? 


Is there something 
that gets completely 
skipped because it 
would take too long 
or is otherwise 
impractical? 

(e.g. because of 


volume) 


Would it be of great 
value if we could do 


it? 


Anything is relatively 


time intensive (even 


if it happens rarely). 


Analysis of video 
meetings and sales 


calls 


Porn detection in 
video uploads, 
Comment 


moderation 


Insurance property 


review" 


@ May take a while per house, but may only need to be done once per 


year or decade 


The main thing to distinguish this is that it’s something that otherwise 
wouldn’t happen. For example a bridge may be inspected annually 
currently, but it would be impractical to inspect it daily. So an automatic 
bridge inspection system will add a new capability. This is a good use case, 
even though currently it is not repeated often (annually), and may or may or 
may not involve expert labor directly, for example, the actual inspectors 
may be looking for cracks and measuring them, while the engineering 
analysis is still done by someone else. Either way it would be impractical 


for anyone to inspect it every day. 


Notes on Repeating use cases 


1. Don’t jump to assuming replacement - first think of augmentation. 

2. Normal coding is fine for forms. Rather think “what does a human do 
when they get the information?” 

3. To get a good idea, look at how many “repeat” roles there are in the 
company. Are there thousands of people doing roughly the same thing? 


That’s a great place to start. 


A note on Specialists and Experts 


All work involves some degree of specialization and training. Instead of 
offering “low hanging fruit” in specific cases, here are some of the areas 
that often have the most opportunity (not necessarily the easiest). The 
expert case is generally meant to signify something that is otherwise of 
substional difficulty to get. Of course what expert means will be up to you, 
my own mental model is something along the lines of “a skill that takes 5- 
10 years, after normal education, to get to a basic level of proficiency, or an 


area that is so cutting edge as to limit the pool of available people”. 


Evaluating Use Case Against the Rubric 


Here we present an example use case in some detail and then compare it to 


the rubric. 
Automatic Background Removal 


Have you recently done a video call and noticed someone’s background was 
blurred? Or maybe you already use this feature yourself. Either way, most 


likely you have already interacted with this Training Data powered product. 


Specifically when you take a video conference call (e.g. Zoom) call, you 
may be using the “background removal” feature (Fig 7.10). This turns a 
messy distracting background into a custom background image or smoothly 


blurred background - seemingly magically. 


Normal Video Automatic Background Video 


Figure 7-9. Comparison of Normal Video to Video with Automatic Background Removal 


For context, this used to require a green screen, custom lighting, and more. 
For high quality productions (like movies), often a human has to manually 


tune the settings. 


So how is training data involved? 


First - we must be able to detect what is “foreground” and what is 
“background”. We can take examples of video, and label the foreground 


data as shown in Fig 7.11. We will use that to train a model to predict the 


spatial location that’s ‘foreground’. The rest will be assumed to be 


background - and blurred. 


Label: Foreground 


Figure 7-10. Example of labeling Foreground 


The point here is that the model is figuring out what patterns make up the 
“background”. We aren’t expressly having to declare what a messy pile of 


clothes looks like. 


If we constrain the problem to assume that only humans will be in view, we 


could take an “off the shelf” model that detects humans and simply use that. 


Why does a feature like this matter? 


e Creates more equality. Now it doesn’t matter if I have a fancy 
background or not. 
e Improve privacy and meeting efficiency. Helps reduce impact of 


disturbances (eg someone entering the edge of the calling area). 


If this seems really simple - that’s the point. That’s the power of training 
data. Something that used to be literally impossible without a green screen 


becomes as easy as labeling videos. 
Now a few gotchas to keep in mind: 


¢ Getting a performant model to do “pixel segmentation” at time of 
writing is still somewhat challenging. 

e There are public datasets of humans that already do a fairly good job. 
However, if you had to label “zoom calls” dataset from scratch it would 


be a tremendous amount of work. 


Evaluation 


Question 


Can we get the 


data? 


Is it repeated 


often? 


Involve an 


Expert? 


Adds a new 
capability? 


Result 


Yes, the video stream is 


already digitally captured. 


Yes, a single video call may 
remove thousands of 
background frames, a single 
user may have multiple calls 
per day, and there are many 


video callers. 


Sometimes. It takes some 
expertise to set up a green 
screen effect, but is far from 
expert medical or 


engineering knowledge. 


Sometimes. It was possible 
before to get a greenscreen, 
but in cases like traveling, 
even a person who owns a 
greenscreen would be 


unable to use it, so in that 


Requirement 


v 


case adding a new 


capability. 


Overall this scores fairly well. It covers the requirement of being able to get 
the data. And gets a huge yes in the “repeated” category. Plus depending on 
what subset of the use case you want to evaluate, it potentially avoids 
having to have an expert and adds a new capability. For example I can’t 
imagine being able to easily get a custom background in the middle of a 


cafe or airport without this. 


A note on use cases: The conceptual effects area has a small list of use cases. For the 
sake of this rubric and timing I dive only into a single use case deeply. My intent is to 
convey the conditions upon which any use case would be good, and provide tools to 
think about the value add and overall effects.Given the breadth of potential use cases I 
believe this is much more valuable than attempting to iterate a list of all known use 


cases, many of such up to date lists can be found with online searches. 


Conceptual Effects of Use Cases 


This is a fairly similar idea to above but from a slightly different 
perspective. In the above section we were looking at this from the 
perspective of “What is a good use case?”. Now I am looking at “What are 
these use cases doing?”. I have also included some surface level obvious 


second order effects - effects caused by adoption of the technology outside 


of the scope of the technology itself. I mean this as merely a starting point 


to get people thinking about second and third order effects. 


Concept 


Relaxing 
Constraints on 
the problem 
itself 

(on a prior 


solved problem) 


Examples 


e Green screen -> Any 


background 


e I must be of a certain 


age to drive 


e Spellcheck - > 


Grammar check 


Second Order 
Effects 


e My 


background 
is not used 
to evaluate 
my 
candidacy 
for a job 
interview 
“Taking the 
kids to x” 
takes ona 
new 
meaning if 
the parent 
isn’t driving. 
Expectations 
around 
correct 
grammar 
change (in 


addition to 


Replacing or 


Augmenting 


Routine Work 


e Human counting sheep 


-> Automatic counting 
of sheep 

Human driving -> Car 
Driving (Parity with 
human performance) 
Human routing 
communication to a 
department -> 
Automatic 
communication 
reporting based on 
intent (sales, support, 


etc). 


correct 


spelling). 


The 
meaning of 
work 
changes 
Millions of 
jobs will be 
created and 
shifted. 
People will 
be required 
to learn new 
skills. 

The Suburbs 
may extend 
further out 
A company 
that is not 
using Al 
effectively 


will have a 


Making 
Humans 


“Superhuman” 


Airport security 
scanning 

Sports analytics 

Self driving (Accident 
Reduction) 

Acting as a “second 
set of eyes” on routine 


medical work 


worse cost 
structure 
then one that 
is. (e.g. 
same as not 
using digital 


effectively). 


Airport 
security may 
become 
more 
effective and 
faster 
(Here’s 
hoping!) 
Intensity of 
sports may 
increase 
since 
expectations 
of elite 


levels of 


Making a 
Constrained 
Resource 
Available to 
more people® 
(or without as 


many limits) 


e Radiologist’s time. 


Prior, a Radiologist 
could only see as 
many people as time in 
the day, now a well 
setup medical system 
can see a nearly 
unlimited number of 
patients. *** many 
asterisks to this, but 
the general idea is 


there 


coaching are 
extended to 


more people 


e Accidents 


may become 
more rare 
and even 
more 


newsworthy 


Medical care 
may become 
more 
accessible. 
The 
meaning of 
a “second 
opinion” 
may change. 
New 
dangers will 
appear, e.g. 


increased 


e This also removes the group think, 


geographical limits data drift, 
e Self driving (more decreased 
mobility because weight on 
lower cab fare due to human 
shared resource) expert 
opinion. 


* This is similar to the “Relaxing Constraints” but the examples are 


fairly different so keeping them in separate categories. 


Ongoing Impact of Use Cases 


At its core, the idea is that training data is an easier way to encode human 


knowledge into a machine. 


The cost to “copy” human understanding approaches zero. While before a 
radiologist’s time was a scarce resource, it will become abundant. Before a 
green screen was only in a film studio, now it’s on my smart phone 


anywhere in the world.2 
This comes with all of the advantages of the internet 


This has the following follow-on implications. 


It is possible to dramatically increase the frequency of items. For 
example, prior a visual bridge inspection may only be able to happen 
once a year or once a decade. Now, an analysis of a similar level could 
happen every few seconds. 

Processes that were previously “random” will become relatively “fixed”. 
We all know car accidents happen. But eventually - they will be rare. 
Turing a previously random process (call me when you get home), to an 
all but sure thing (International news, first car accident in last 24 months 
happened). 

Previously impossible things will become possible. For example, putting 
a “dentist in your pocket”, that eventually you will be able to point your 
phones sensors at your mouth and get a level of insight that previously 
would have required a dental visit. 

Increased personalization and effectiveness of “personal” assistance will 


increase. 


Rethink AI Annotation Talent - quality over 
quantity 


¢ Buying into the training data first, data centric first mindset is one thing 


¢ Who is actually doing the annotation is a different decision 


e Closely related 


e Idea of things aren’t actually controversial - just awareness (e.g. 8/10 


agree once aware of it) 


e Walk through logic. Can resonate. 


Who is annotating, and training data culture go hand in hand. The more 
people who understand training data, the more opportunities will rise up. 


The more your own staff and experts are involved, the higher the quality. 
Key Levers on Training Data ROI 


e Talent, or Who is annotating. To quote someone who manages a team 
of 100 annotators: “The biggest determinant of annotation quality is the 
person who created the annotation.” 

¢ Degree of Training Data Culture. This can be a step function type 
delta. Either people are aware something can be turned into training data 


or they aren’t. 


These factors will ultimately determine the “shelf life” of training data, and 


it’s ability to be leveraged into tangible productivity improvements. 


Let’s think about what the Annotated Data Represents 


1. Your business know-how, trade-secrets, processes, competences 
2. The massive labor investment to create, update, protect, and maintain it 


3. Your key to staying competitive during the AI transformation 


Thinking about your annotation in that context further supports the need to 


have a dedicated business unit focused on it. It also showcases the need for 


awareness at all stages of the buying process. For example, if you are 
buying a system to automate something, and the vendor is responsible for 


the annotated date, what does that mean for your future? 


Benefits of controlling your own training data 


1. A better cost model by using your existing team 
2. Control over the economics and quality of the output 
3. A better cost model by creating a shareable, reusable library of training 


data 


The Need for Hardware 


First let’s get some sticker shock out of the way. Big, mature AI companies 
spend $10s-100s+ of millions of dollars on AI compute (e.g. GPUs, 


ingestion, Storage). This means that hardware cost is a key consideration. 


Second, training data is your new gold. It’s one of your most important 
assets. People may be your most important asset, but this is a literal 


embodiment of people. It’s important. How do you protect it? 


Realistically, if you aren’t controlling the hardware, there is very little way 
to protect that data. What happens if there is a contract dispute? What if the 
vendor’s controls aren’t as good as you thought? You have your key 
business data and records under your control - and training data must be the 


same way. To be clear here - I don’t just mean say the annotation tooling. 


Any data that any of your vendors are capturing that’s used for training Al 


must be considered here. 


This means that while a SaaS solution may be ok for getting started, proof 
of concept etc, for training data the hardware costs, and degree of 


importance to the company is too great to not take control of it. 


Practically speaking, if you leave any type of prediction, annotation 
automation, etc. in the hands of a vendor’s server (unless it’s being run on 
the client) this means you will be actually negotiating a huge amount of 


hardware costs hidden inside the annotation tooling cost. 


Common Project Mistakes 


Under Resourcing is especially prevalent here. I have seen a number of solo 
professionals from pathology to dentistry be curious about AI. While this 
curiously is naturally great, the reality is you need a strong team effort and 
substantial resources to build something that will be even true prototype 


level or production. 


e Under Resourced. A single Dr is unlikely to make a general purpose AI, 
even for their niche area. 

e Mistakes on volume of data needed. For example a major dental office’s 
x-rays for all time for all patients may be substantial, but on its own still 


probably is not enough data for a general purpose dentist AI. 


¢ Most AI projects have a very very long time horizon. It takes on the 
order of months to years to build reasonable systems. And often the 


expected lifetime and maintenance is measured in years or even longer. 


Adopt Modern Training Data Tools 


Usage of Training Data tools. Effective usage can have orders of 
magnitude differences. The first step is to gain awareness of what high level 
concepts exist. If you have read all the way through you have already taken 
a big step in this direction, getting this book and other material in front of 


your team is a great way to help accelerate this process. 


Software can help provide guard rails and encouragement to a 


transformation, but it is only a part of the overall transformation. 


Training Data focused software is designed to scale to the breadth and depth 
of complexity that’s needed to actually capture your business processes in 
the fidelity required. There’s a big difference between tagging a whole 
image, vs a polygon with complex attributes, that’s gone through a 


dedicated review process. 


Training data software has come a long way and has had millions of dollars 
of investment since its humble origins. Modern training data software is 


more in line with an Office Suite, with multiple complex applications 


interacting together. While perhaps not quite as complex line for line, 


directionally it will be in a few years time. 


Business Models 
There are two major business models tools are sold under 


e Pay per user 


e Pay per use 


TODO expand on this 


Think Learning Curve not Perfection 


There is a tendency to seek out perfection and familiarity in training data 
software. Especially if an existing early team happens to be familiar with a 
certain pattern. The simple truth is that all software has bugs. The other day 
I used google search and it duplicated the menus and search results. That’s a 


product with trillions of engineering effort over decades! 


If we think back to early computer applications, they had terribly obscure 
UIs. People had to learn many concepts to perform simple tasks? The same 


general principle applies here. 


The other truth to understand is that these application and use cases are 


continuing to increase in complexity. When I first started this I could 


provide a demo of most of the key functions in half an hour. Now, even if I 
scope the demo to a specific persona (like an annotator), and a specific 
media type (like image), it still may take half an hour! A complete end to 
end coverage would be days... similar to how even a basic training course 
to someone who’s never used a word processor or spreadsheet would take 


days. 
Slowing it down 


While UI design and customization is important, over focus on it can miss 
key points. If the SEMs are so busy that they can’t take some basic training 
(or just time) to figure out how to use the application, then realistically are 


they going to provide quality annotations? 


Customization and Configuration 


Further, while off the shelf software should absolutely be the starting point, 
we have to recognize that there will always be the need for some degree of 


customization and configuration. 


New Training and Knowledge are Required 
Everyone 


1. The introduction: A high level overview of what Supervised AI is and 


how it relates to your specific business. 


2. Reassurance: That it will increase productivity - Proposals and 
Corrections - not replace their job. 

3. The concept: It’s supervision leading to more productive work. 

4. The ask: To bring up ideas of processes that are ripe for this type of 


supervision. 


Annotators 


1. The basics of annotation tooling. To continue the office analogy, 
knowing how to use annotation tooling will be the modern equivalent of 
learning word processing. 

2. More in depth training, such as in part reading this book, and further 


training on sensitive issues like bias. 


Managers 


All of the above and: 


1. What questions to ask regarding new and updated processes 

2. How to identify financially viable training data opportunities 

3. To reflect on productivity goals in this new world of AI annotation. 
Every moment of work done that’s not captured in annotation is a 


moment lost. 


Executives 


1. To reflect on the company organization, such as creation of a new 
Training Data unit 

2. To germinate, nurture, and guard the culture around training 

3. To carefully consider vendor choices in relation to securing future Al 


goals 


Producing And Consuming Training Data 


One of the most confusing aspects of this is: “Who is making the software 
that consumes training data?” “Who is producing the training data?”. Here 


are the big buckets I have seen: 


1. A software focused company produces an AI powered product, and the 
bulk or all of the training data is also produced by them. The firm 
releases the software to consumers or another firm buys the software and 
is the end user. 

2. A company with in-house training data production capabilities, creates 
the software for its own internal use. Often this may involve leaning on 
external partners or very substantial investment. 

3. A software firm produces an AI powered product, but leaves the bulk of 


the training data to be produced by the end user buying the software. 


The only case where the end user company is not involved in the training 
data production is 1). In general, that either shifts the core competence of 


the business to that software provider, or it means there is a relatively static 


product being provided. An analogy here would be buying a “website” that 
can’t be updated. Since few people want a website they can’t update, in 
general, the trend will likely be towards the company always being able to 


produce it’s own training data in some way. 


From the executive viewpoint, in a sense this is really the most key 


question: “Do you want to produce your own training data?”. 


Trap to Avoid: Premature Optimization in Training Data 


Trap 


Thinking a 


Trained Model 


1. Effort is 
taken to 
train a 


model. 


2. It sort of 


works and 
people get 


excited. 


3. Assume it 


just needs 
some small 


adjustments 


4. Realize it’s 


far from 


done. 


How it Happens Warning Signs 


e The “Trained 


Model” is 
discussed as 
the end goal. 
Ongoing 
continuous 
annotation is 
not 
discussed. 
Iteration is 
talked about, 
but in the 
context of a 
limited time 


window. 


Avoiding it 


e Educate 
people th 
the goal i 
setup a 
continual 
improver 
system, n 
single on 
model 

e Discuss 
upfront wv 
performa 
level is g 
enough t 
ship vers. 
1. For 
example 
self drivi 
some hav 
taken the 
approach 


“equal to 


Committing to 
a Schema too 


early 


1. Use upa 
lot of 
resources 


annotating 


2. Realize the 


labels, 
attributes, 
overall 
Schema etc 
do not 
match their 
needs. E.g. 
was using 
bounding 
boxes and 
realize 
need 


keypoints 


e Schema does 


not change 
significantly 
between 
early pilot 
work and 
more major 
work. 
Schema was 
determined 
with minimal 
data science 
involvement 
“Final” 
Schema was 
determined 
without proof 
that 


achieving 


human is 
good 


enough”. 


Expect tk 
Schema t 
change 
Try many 
different 
Schema 
approach 
with actu 
models tc 
what actt 
works - d 
assume 
anyone h 
the right 
backgrou 
to know t 
answer ir 


advance. 


success will e Ask: If tk 
solve model mi 
downstream perfect 
problems. predictio: 
does this 
actually s 
our 
downstre 
use case? 
if it perfe 
predicts t 
box on re 
will that : 
the overa 


problem 


Commiting to 


: 1. Look at e Unrealistic e Realize it 
Automations 


how much Expectations. doesn’t te 
too early 

resources yee much hu 
human Automation annotatio 
annotation expected to Start getti 


may take virtually direction. 


2. Look for 
automation 
solutions 

3. At first are 
happy with 
automation 
results 

4. Realize the 
automation 
isn’t quite 
doing what 
they that it 


was doing 


solve 
annotation 
Automation 
explored 
without 
involving 
data science 
Automation 
plans being 
discussed in 
detail before 
any 
significant 
human 
annotation 
work has 
been done. 
A reduction 
in 
management 
mindshare in 
training data 
under false 


assumption 


understar 
of needs. 
Use mini 
or even Zz 
automatit 
until you 
have don 
lots of 
manual 
human 
annotatio 
Do more 
human 
annotatio 
than you 
think you 
need. Aft 
you will | 
the best 
position 1 
choose tt 
most effe 


automatic 


Wrongly 


calculating 


volume of 


work 


. Look at 


overall 


dataset size 


. Project 


how many 


annotations 


. Assume 


will need 


all of them 


that 


automations 


handle it. 


e Assuming all 


available 


data needs to 
be annotated 
(data can be 


filtered for 


most 
valuable 


items) 


e Not 


considering 


ongoing 


accumulation 


of data or 


e Think of 


automati( 
as an 
expected 
of the 
process. | 


silver bul 


Get a lars 
enough 

sample o 
actual we 
to unders 
how long 
each sam 
usually te 
Realize tl 
it’s alway 
going to | 
moving 

target. Fc 
example 


work per 


Not spending 
enough time 
with tooling 
Not getting the 


right tools 


production 
data 

Not 
considering 
diminishing 
returns. E.g. 
that every 
further 
annotated 
item adds 
incrementally 
less value 
then the prior 


item 


Overfocus on 
“getting a 
data set 
annotated” 
instead of 
what the 


actual result 


sample 1 
get harde 
the mode 


get better 


e Realize tl 
new 
platform: 
like 
Photosho 
walks int 
bar and n 


a gruff 


is going to 
be. 


Unrealistic 


expectations. 


Treating it 
more like a 
fly by 
shopping 


website then 


a serious new 


suite of 
productivity 


tools. 


Taskmast 
who 
moonligt 
a Data 
Engineer 
complex 
new. 

The more 
powerful 
tool the n 
the need 
understar 
Treat it 1 
like learn 
a new sul 
area, a ne€ 


Art. 


No Silver Bullets 


* I’m not commenting if this approach is right or wrong. 


There are no silver bullets to Annotation. Annotation is work. For that work 
to have value it must be literally labored at. All of the means to improve the 


productivity of annotation must have some base that exists. 


Training data must be relevant to your business use case. To do this, it 
needs the insights from your employees. Everything else is noise, or in 


depth expert, situationally specific concepts. 


Instead of looking at some of these optimizations as true gains, rather see 


missing out on them as being functionally illiterate. 


This caution is important because, being such a new area, the norms are 


poorly established. 


Culture of Training Data 


One of the biggest mis-conceptions about modern training data is that it is 
exclusively the realm of data science. This misconception really holds 


teams back. 
True data science is very hard work. 


However, the context in which that hard work must be employed is often 


confused. 


If the AI projects are left as “that’s the data science teams’ responsibility” 


then how likely do you think success company wide will be? 


It’s very clear that: 


1. Non-data science expertise, in the form of subject matter knowledge, is 
the core of training data. 

2. The ratio of people who are SEMs to data scientists is of vast fields to a 
single grain. 

3. Data science work is increasingly becoming automatic and integrated 


into applications 


The great thing here is that most data scientists would be happy if other 
people knew more about training data. As a data scientist, I don’t want to 
worry about the training data for the most part, I have enough of my own 


concerns. 


Not everything, at least right away, is a good candidate for training data. 
Being able to recognize what will likely work and what won’t work is a key 
part of this culture.The more bottom-up the ideas for this, the less likely it 


will be something that appears easy but is actually terribly hard and fails. 


Getting everyone involved with training data is at the heart of AI 
transformation. In the same way that the IT team doesn’t magically recreate 


every business process on their own. Each line of business manager brings a 


growing awareness of the capabilities of digital tools, what questions to ask, 


etc. 


New Engineering Principles 


The first step is to recognize the intersection of the business need, and the 


expert annotators, with the day to day concerns about data. 


Creating Embedded Annotation Experiences 
Creating annotation experiences embedded into existing and new 
applications. User interfaces, data extraction to training data systems, 


receiving predictions etc. 


Making the Training Data system a Standard central component 

To run a web server we use standard web server tech. It’s the same with 
training data. As the complexity and investment in this space continues 
to grow it makes more and more sense to shift that training data into a 


dedicated system, or set of systems. 


Abstracting the Training Data from Data Science 

A simple break point here is dataset creation. Instead of a dataset being 
something that’s static, if we allow Training Data to be created on it’s 
own accord, and data science to query it, we get a much more powerful 


separation of responsibility. Of course in practice there will be some 


interplay and communication, but directionally that provides a much 


cleaner scope of responsibilities. 


As a quick note on outsourcing, even if a percent of the annotators are 
outsourced, there is still such a substantial budget involved that it’s 


probably better to think of them as part time or even full time equivalents. 


Technically this is the “higher dimensional space”, in which we are unable 
to reasonably represent visually. Many machine learning models having 
hundreds of dimensions while we can only reasonably graph 4 (space 


(x,y,z) and time (t))). 
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